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Cluster  analysis,  which  attempts  to  place  objects  into  reasonable  groups  on 
the  basis  of  statistical  data  measured  on  them,  is  an  important  exploratory  tool 
for  many  scientific  studies.  In  particular,  we  explore  the  problem  of  clustering 
functional  data,  which  arise  as  curves,  characteristically  observed  as  part  of 
a continuous  process.  In  recent  years,  methods  for  smoothing  and  clustering 
functional  data  have  appeared  in  the  statistical  literature,  but  little  work  has 
appeared  specifically  addressing  the  effect  of  smoothing  on  the  cluster  analysis. 
We  discuss  the  purpose  of  cluster  analysis  and  review  some  common  clustering 
methods,  with  attention  given  to  both  deterministic  and  stochastic  methods. 

We  address  functional  data  and  the  related  field  of  smoothing,  and  a measure  of 
dissimilarity  for  functional  data  is  suggested. 

We  examine  the  effect  of  smoothing  functional  data  on  estimating  the  dis- 
similarities among  objects  and  on  clustering  those  objects.  We  prove  that  a 
shrinkage  method  of  smoothing  results  in  a better  estimator  of  the  dissimilarities 
among  a set  of  noisy  curves.  For  a model  having  independent  noise  structure. 


the  smoothed-data  dissimilarity  estimator  dominates  the  observed-data  estima- 
tor. For  a dependent-error  model,  an  asymptotic  domination  result  is  given  for 
the  smoothed-data  estimator.  We  propose  an  objective  function  to  measure  the 
goodness  of  a clustering  for  smoothed  functional  data. 

Simulations  give  strong  empirical  evidence  that  smoothing  functional  data 
before  clustering  results  in  a more  accurate  grouping  than  clustering  the  observed 
data  without  smoothing.  Two  examples,  involving  functional  data  on  yeast  gene 
expression  levels  and  research  library  '‘growth  curves."  illustrate  the  technique. 


CHAPTER  1 

INTRODUCTION  TO  CLUSTER  ANALYSIS 

The  goal  of  cluster  analysis  is  to  find  groups,  or  clusters,  in  data.  The  objects 
in  a data  set  (often  univariate  or  multivariate  observations)  should  be  grouped  so 
that  objects  in  the  same  cluster  are  similar  and  objects  in  different  clusters  are 
dissimilar  (Kaufman  and  Rousseeuw.  1990.  p.  1).  How  to  measure  the  similarity 
of  objects  is  something  that  depends  on  the  application,  yet  is  a fundamental  issue 
in  cluster  analysis.  Sometimes  in  a multivariate  data  set  it  Is  not  the  observations 
that  are  clustered,  but  rather  the  variables  (according  to  some  similarity  measure 
on  the  variables)  and  this  case  is  dealt  with  slightly  differently  (Johnson  and 
Wichern,  1998,  p.  735).  More  often,  though,  it  is  the  objects  that  are  clustered 
according  to  their  observed  values  of  one  or  more  variables,  and  this  introduction 
will  chiefly  focus  on  this  situation. 

The  general  clustering  setup  for  multivariate  data  is  as  follows:  In  a data  set 
there  are  N objects  on  which  are  measured  p variables.  Hence  we  represent  this 

by  N vectors  yi yw  in  S?p.  We  wish  to  group  the  N objects  into  K clusters. 

I < K < N.  Denote  the  possible  clusterings  of  N objects  into  nonempty  groups 
as  C = {ci, . . . ,ca(p(}.  The  number  of  possible  clusterings  B(N)  depends  on  the 
number  of  objects  N and  is  known  as  the  Bell  number  (Sloane  and  Plouff,  1995, 
entry  M4981). 

As  a simple  example,  consider  the  data  in  Table  1-1.  We  wish  to  group  the 
12  objects  (the  European  countries)  based  on  the  values  of  two  variables,  gross 
national  product  (gnp)  and  percent  of  gnp  due  to  agriculture  (agric).  When  the 
data  are  univariate  or  two<dimensional  and  N is  not  too  large,  it  is  often  easy  to 


Table  1-1:  Agricultural  data  far  Europe 


higher  dimensions,  however,  automated  clustering  methods  become  necessary. 

A statistical  field  closely  related  to  cluster  analysis  is  discriminant  analysis, 
which  also  attempts  to  classify  objects  into  groups.  The  main  difference  is  that 
in  discriminant  analysis  there  exists  a training  sample  of  objects  whose  group 
memberships  are  known,  and  the  goal  is  to  use  characteristics  of  the  training 
sample  to  devise  a rule  which  classifies  future  objects  into  the  prespecified  groups. 
In  cluster  analysis,  however,  the  clusters  are  unknown,  in  form  and  often  in 
number.  Thus  cluster  analysis  is  more  exploratory  in  nature,  whereas  discriminant 
analysis  allows  more  precise  statements  about  the  probability  of  making  inferential 
errors  (Gnanadesikan  et  al.,  1989). 

In  contrast  with  discriminant  analysis,  where  the  number  of  groups  and  the 
groups'  definitions  are  known,  cluster  analysis  presents  two  separate  questions: 

How  many  groups  are  there?  And  which  objects  should  be  allocated  to  which 
groups,  i.e..  how  should  the  objects  be  partitioned  into  groups?  It  is  partially 
because  of  the  added  difficulty  of  answering  both  of  these  questions  that  the  field 


Figure  1-1:  A scatter  plot  of  the  agricultural  data. 


of  cluster  analysis  has  not  built  such  an. extensive  and  thorough  theory  as  has 
discriminant  analysis  (Gnanadesikan  et  ah,  1989). 

1.1  The  Objective  Function 

Naturally,  it  is  desirable  that  a clustering  algorithm  have  some  optimal 
property.  We  would  like  a mathematical  criterion  to  measure  how  well-grouped  the 
data  are  at  any  point  in  the  algorithm.  A convenient  way  to  define  such  a criterion 
is  via  an  objective  function,  a real-valued  function  of  the  possible  partitions  of  the 
objects.  Mathematically,  if  C = {ci,  — Cs<jvi } represents  the  space  of  all  possible 
partitions  of  N objects,  a typical  objective  function  is  a mapping  g :C  R+. 

Ideally,  a good  objective  function  g will  increase  (or  decrease,  depending  on 
the  formulation  of  g ) monotonically  as  the  partitions  group  more  similar  objects 
in  the  same  cluster  and  more  dissimilar  objects  in  different  clusters.  Given  a good 
objective  function  g.  the  ideal  algorithm  would  optimize  g,  resulting  in  the  best 
possible  partition. 

When  N is  very  small,  we  might  enumerate  the  possible  partitions cb(n), 
calculate  the  objective  function  for  each,  and  choose  the  c,  with  the  optimal  p(c,). 
However,  B(N)  grows  rapidly  with  .V . For  example,  for  the  European  agriculture 
data.  B(12)  = 4.213.597.  while  B(19)  = 5.832.742,205.057  (Sloane  and  Plonffe, 

1995,  entry  M1484).  For  moderate  to  large  ;Y . this  enumeration  is  infeasible 
(Johnson  and  Wichem.  1998,  p.  727). 

Since  full  enumeration  is  usually  impossible,  clustering  methods  tend  to  be 
algorithms  systematically  designed  to  search  for  good  partitions.  But  such  deter- 
ministic algorithms  cannot  guarantee  the  discovery  of  the  best  overall  partition. 

1.2  Measures  of  Dissimilarity 

A fundamental  question  for  most  deterministic  algorithms  is  which  measure  of 
dissimilarity  (distance)  to  use.  A popular  choice  is  the  Euclidean  distance  between 


object  i and  object  j 


= 1/ tel  - »t)*  + (j/.a  - »«)2  (y.p  - »p)j- 
The  Manhattan  (city-block)  distance  is 


Certain  types  of  data  require  specialized  dissimilarity  measures.  The  Canberra 
metric  and  Czekanowski  coefficient  (see  Johnson  and  Wichern.  1998,  p.  729)  are 
two  dissimilarity  measures  for  nonnegative  variables,  while  Johnson  and  Wichern 
(1998,  p.  733)  give  several  dissimilarity  measures  for  binary  variables. 

Having  chosen  a dissimilarity  measure,  one  can  construct  a N x N (symmetric) 
dissimilarity  matrix  (also  called  the  distance  matrix ) D whose  rows  and  columns 
represent  the  objects  in  the  data  set.  such  that  Dy  = D,,  - d(i,j). 

In  the  following  sections,  some  common  methods  of  cluster  analysis  are 
presented  and  categorized  by  type. 

1.3  Hierarchical  Methods 

Hierarchical  methods  can  be  either  agglomerative  or  divisive.  Kaufman  and 
Rousseeuw  (1990)  compiled  a set  of  methods  which  were  adopted  into  the  cluster 
library  of  the  S-plus  computing  package,  and  often  the  methods  are  referred  to  by 
the  names  Kaufman  and  Rousseeuw  gave  them. 

Agglomerative  methods  begin  with  N clusters;  that  is,  each  observation  forms 
its  own  cluster.  The  algorithm  successively  joins  clusters,  yielding  N - 1 clusters, 
then  N — 2 clusters,  and  so  on  until  there  remains  only  one  cluster  containing  all 
N objects.  The  S-plus  functions  agues  and  hclust  perform  agglomerative  clustering. 
Common  agglomerative  methods  include  linkage  methods  and  Ward's  method. 

All  agglomerative  methods,  at  each  step,  join  the  two  clusters  which  are 
considered  “closest.”  The  difference  among  the  methods  is  how  each  defines 


"closeness."  Each  method,  however,  defines  the  distance  between  two  clusters  using 
some  function  of  the  dissimilarities  among  individual  objects  in  those  dusters. 

Divisive  methods  begin  with  all  objects  in  one  cluster  and  successively  split 
clusters,  resulting  in  partitions  of  1.  2,  3, . . . and  finally  N dusters.  The  S-plus 
function  diana  performs  divisive  analysis. 


1.4  Partitioning  Methods 

While  hierarchical  methods  seek  good  partitions  for  all  K = 1 ,V, 

partitioning  methods  fix  the  number  of  clusters  and  seek  a good  partition  for 
that  specific  K.  Although  the  hierarchical  methods  may  seem  to  be  more  flexible, 
they  have  an  important  disadvantage.  Once  two  clusters  have  been  joined  in  an 
agglomerative  method  (or  split  in  a divisive  method),  this  move  can  never  be 


undone,  although  later  in  the  algorithm  undoing  the  move  might  improve  the 
clustering  criterion  (Kaufman  and  Rousseeuw,  1990,  p.  44).  Hence  hierarchical 
methods  severely  limit  how  much  of  the  partition  space  C can  be  explored.  While 
this  phenomenon  results  in  higher  computational  speed  for  hierarchical  algorithms, 
its  clear  disadvantage  often  necessitates  the  use  of  the  less  rigid  partitioning 
methods  (Kaufman  and  Rousseeuw,  1990.  p.  44). 

In  practice.  Johnson  and  Wichern  (1998.  p.  760)  recommend  running  a 
partitioning  method  for  several  reasonable  choices  of  K and  subjectively  examining 
the  resulting  clusterings.  Finding  an  objective,  data-dependent  way  to  specify  K 
is  an  open  question  that  has  spurred  recent  research.  Rousseeuw  (1987)  proposes 
to  select  K to  maximize  the  average  silhouette  width  s(K).  For  each  object  i,  the 
silhouette  value 


W max{a(i),  6(t)} 

where  a(i)  = average  dissimilarity  of  t to  all  other  objects  in  its  cluster  (say,  cl 


cluster  R.  Then  s(K)  is  the  average  s(i)  over  all  objects  i in  the  data  set. 


Tibshirani  et  al.  (2001a)  propose  a.  “Gap"  statistic  to  choose  K.  In  a separate 
paper,  Tibshirani  et  al.  (2001b)  suggest  treating  the  problem  as  in  model  selection, 
and  choosing  K via  a “prediction  strength"  measure.  Sugar  and  James  (2003) 
suggest  a nonparametric  approach  to  determining  K based  on  the  “distortion," 
a measure  of  within-clustor  variability.  Fraley  and  Raftery  ( 1998)  use  the  Bayes 
Information  Criterion  to  select  the  number  of  clusters.  Milligan  and  Cooper  (1985) 
give  a survey  of  earlier  methods  of  choosing  A'. 

Model-based  clustering  takes  a different  perspective  on  the  problem.  It  as- 
sumes the  data  follow  a mixture  of  K underlying  probability  distributions.  The 
mixture  likelihood  is  then  maximized,  and  the  maximum  likelihood  estimate  of 
the  mixture  parameter  vector  determines  which  objects  belong  to  which  subpop- 
ulations. Fraley  and  Raftery  (2002)  provide  an  extensive  survey  of  model-based 
clustering  methods. 

1.4.1  K-means  Clustering 

Among  the  oldest  and  most  well-known  partitioning  methods  is  K-means 
clustering,  due  to  MacQuoen  ( 1967).  .Vote  that  the  centroid  of  a cluster  is  the 
p-dimensional  mean  of  the  objects  in  that  cluster.  After  the  choice  of  K,  the 
K-means  algorithm  initially  arbitrarily  partitions  the  objects  into  K clusters. 
(Alternatively,  one  can  choose  K centroids  as  an  initial  step.)  One  at  a time,  each 
object  is  moved  to  the  cluster  whose  centroid  is  closest  (usually  Euclidean  distance 
is  used  to  determine  this).  When  an  object  is  moved,  centroids  are  immediately 
recalculated  for  the  cluster  gaining  the  object  and  the  cluster  losing  it.  The  method 
repeatedly  cycles  through  the  list  of  objects  until  no  reassignments  of  objects  lake 
place  (Johnson  and  Wichern.  1998.  p.  755). 

A characteristic  of  the  K-means  method  is  that  the  final  clustering  depends 
in  part  on  the  initial  configuration  of  the  objects  (or  initial  specification  of  the 
centroids).  Hence  in  practice,  one  typically  reruns  the  algorithm  from  various 


starting  points  to  monitor  the  stability  of  the  clustering  (Johnson  and  Wichem. 
1998,  p.  755). 

Selim  and  Ismail  (1984)  show  that  K-ineans  clustering  does  not  globally 

£X>?/ m (u) 

where  dtj  denotes  the  Euclidean  distance  between  object  i and  the  centroid  of 
cluster  j,  Le..  dj  = (yi  - yu,)'(yi  - yul).  This  criterion  is  in  essence  an  objective 
function  g(c).  The  K-mcans  solution  may  not  even  locally  minimize  this  objective 
function:  conditions  for  which  it  is  locally  optimal  are  given  by  Selim  and  Ismail 
(1984). 

1.4.2  K-rnedoids  and  Robust  Clustering 

Because  K-mcans  uses  means  (centroids)  and  a least  squares  technique  in 
calculating  distances,  it  is  not  robust  with  respect  to  outlying  observations.  K- 
medoids,  which  is  used  in  the  S-plus  function  pam  (partitioning  around  medoids). 
due  to  Kaufman  and  Rousseeuw  (1987),  has  gained  support  as  a robust  alternative 
to  K-tneans.  Instead  of  minimizing  a sum  of  squared  Euclidean  distances,  K- 
medoids  minimizes  a sum  of  dissimilarities.  Philosophically,  K-medoids  is  to 
K-means  as  least-absolute-residuals  regression  is  to  least-squares  regression. 

Consider  the  objective  function 

(i.2) 

The  algorithm  begins  (in  the  so-called  build-step)  by  selecting  k representa- 
tive objects,  called  medoids . based  on  an  objective  function  involving  a sum  of 
dissimilarities  (see  Kaufman  and  Rousseeuw,  1990,  p.  102).  It  proceeds  by  as- 
signing each  object  « to  the  cluster  j with  the  closest  medoid  mj,  i.e.,  such  that 
<l('i  mi)  S d(i,  m,)  for  all  te  = 1, ....  K,  Next,  in  the  swap-step,  if  swapping 
any  unselected  object  with  a medoid  results  in  the  decrease  of  the  value  of  (1.2), 


the  swap  is  made.  The  algorithm  stops  .when  no  swap  can  decrease  (1.2).  Like 
K-ineans.  K-medoids  does  not  in  general  globally  optimize  its  objective  function 
(Kaufman  and  Roussecuw.  1990.  p.  110). 

Cuesta- Albertos  et  al.  (1997)  propose  another  robust  alternative  to  K-means 
known  as  "trimmed  K-means."  which  chooses  centroids  by  minimizing  an  objective 
function  which  is  based  only  on  an  optimally  chosen  subset  of  the  data.  Robustness 
properties  of  the  trimmed  K-means  method  are  given  by  Garcia-Escudero  and 
Gordaliza  (1999). 

1.5  Stochastic  Methods 

In  general,  the  objective  function  in  a cluster  analysis  can  yield  optima 
which  are  not  global  (Selim  and  Alsultan.  1991).  This  leads  to  a weakness  of  the 
traditional  deterministic  methods.  The  deterministic  algorithms  are  designed 
to  severely  limit  the  number  of  partitions  searched  (in  the  hope  that  “good" 
clusterings  will  quickly  be  found).  If  g is  especially  ill-behaved,  however,  finding  the 
best  clusterings  may  require  a more  wide-ranging  search  of  C.  Stochastic  methods 
are  ideal  tools  for  such  a search  (Robert  and  Casella,  1999,  Section  1.4). 

The  major  advantage  of  a stochastic  method  is  that  it  can  explore  more  of  the 
space  of  partitions.  A stochastic  method  can  be  designed  so  that  it  need  not  always 
improve  the  objective  function  value  at  each  step.  Thus  at  some  steps  the  method 
may  move  to  a partition  with  a poorer  value  of  gt  with  the  benefits  of  exploring 
parts  of  C that  a deterministic  method  would  ignore.  Unlike  "greedy"  deterministic 
algorithms,  stochastic  methods  sacrifice  immediate  gain  for  greater  flexibility  and 
the  promise  of  a potentially  better  objective  function  value  in  another  area  of  C. 

The  major  disadvantage  of  stochastic  cluster  analysis  is  that  it  takes  more 
time  than  the  deterministic  methods,  which  can  usually  be  run  in  a matter  of 
seconds.  At  the  time  most  of  the  traditional  methods  were  proposed,  this  was  an 
insurmountable  obstacle,  but  the  growth  in  computing  power  has  narrowed  the  gap 
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in  recent  years.  Stochastic  methods  that  can  run  in  a reasonable  amount  of  time 
con  be  valuable  additions  to  the  repertoire  of  the  practitioner  of  cluster  analysis. 
Since  cluster  analysis  often  involves  optimizing  an  objective  function  g,  the 
Monte  Carlo  optimization  method  of  simulated  annealing  is  a natural  tool  to  use 
in  clustering.  In  the  context  of  optimization  over  a large,  finite  set,  simulated 
annealing  dates  to  Metropolis  et  al.  (1953),  while  its  modem  incarnation  was 
introduced  by  Kirkpatrick  et  al.  (1983). 

An  example  of  a stochastic  cluster  analysis  algorithm  is  the  simulated  anneal- 
ing algorithm  of  Selim  and  Alsultan  (1991).  which  seeks  to  minimize  the  K-rneans 
objective  function  (1.1).  Colcux  and  Govacrt  (1992)  propose  two  stochastic  cluster- 
ing methods  based  on  the  EM  algorithm. 

1.6  Role  of  the  Dissimilarity  Matrix 
For  many  standard  cluster  analysis  methods,  the  resulting  clustering  structure 
is  determined  by  the  dissimilarity  matrix  (or  distance  matrix)  D (containing 

elements  which  we  henceforth  denote  Sq,  i = 1, . . . , N;  j = 1 N)  for  the  objects. 

With  hierarchical  methods,  this  is  explicit:  If  we  input  a certain  dissimilarity 
matrix  into  a clustering  algorithm,  we  will  get  one  and  only  one  resulting  grouping. 
With  partitioning  methods,  the  fact  is  less  explicit  since  the  result  depends  partly 
on  the  initial  partition,  the  starting  point  of  the  algorithm,  but  this  is  an  artifact 
of  the  imperfect  search  algorithm  (which  can  only  assure  a “locally  optimal'’ 
partition),  not  of  the  clustering  structure  itself.  An  ideal  search  algorithm  which 
could  examine  every  possible  partition  would  always  map  an  inputted  dissimilarity 
matrix  to  a unique  final  clustering. 

Consider  the  two  most  common  criterion-based  partitioning  methods,  K- 
medoids  (the  Splus  function  pom)  and  K-means.  For  both,  the  objective  function 
is  a function  of  the  pairwise  dissimilarities  among  the  objects.  The  K-medoids 
objective  function  simply  involves  a sum  of  elements  of  D.  With  K-means,  the 
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connection  involves  a complicated  recursive  formula,  as  indicated  by  Gordon 
(1981.  p.  42).  (Because  the  connection  is  so  complicated,  in  practice.  K-means 
algorithms  accept  the  data  matrix  as  input,  but  theoretically  they  could  accept  the 
dissimilarity  matrix  - it  would  just  slow  down  the  computational  time  severely.) 
The  result,  then,  is  that  for  these  methods,  a specified  dissimilarity  matrix  yields  a 
unique  final  clustering,  meaning  that  for  the  purpose  of  cluster  analysis,  knowing  D 
is  as  good  as  knowing  the  complete  data. 

If  the  observed  data  have  random  variation,  and  hence  the  measurements  on 
the  objects  contain  error,  then  the  distances  between  pairs  of  objects  will  have 
error.  If  we  want  our  algorithm  to  produce  a clustering  result  that  is  close  to  the 
“true"  clustering  structure,  it  seems  desirable  that  the  dissimilarity  matrix  we  use 
reflect  as  closely  as  possible  the  (unknown)  pairwise  dissimilarities  between  the 
underlying  systematic  components  of  the  data. 

It  is  intuitive  that  if  the  dissimilarities  in  the  observed  distance  matrix  are 
near  the  "truth."  then  the  resulting  clustering  structure  should  be  near  the  true 
structure,  and  a small  computer  example  helps  to  show  this.  We  generate  a sample 
of  60  3-dimensional  normal  random  variables  (with  covariance  matrix  I)  such  that 
15  observations  have  mean  vector  (1,3,1)'.  15  have  mean  (10,6,4)',  15  have  mean 
(1,10.2)',  and  15  have  mean  (5,1,10)'.  These  means  are  well-separated  enough 
that  the  data  naturally  form  four  clusters,  and  the  true  clustering  is  obvious. 

Then  for  100  iterations  we  perturb  the  data  with  random  iV(0,<7J)  noise  having 
varying  values  of  o.  For  each  iteration,  we  compute  the  dissimilarities  and  input 
the  dissimilarity  matrix  of  the  perturbed  data  into  the  K-medoids  algorithm  and 
obtain  a resulting  clustering. 

Figure  1-2  plots,  for  each  perturbed  dataset,  the  mean  (across  elements) 
squared  discrepancy  from  the  true  dissimilarity  matrix  against  the  proportion  of 
all  possible  pairs  of  objects  which  are  correctly  matched  in  the  clustering  resulting 
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Figure  1 -2:  Proportion  of  pairs  of  objects  correctly  grouped  vs.  MSE  of  dissimilari- 

from  that  perturbed  matrix.  (A  correct  match  for  two  objects  means  correctly 
putting  the  two  objects  in  the  same  cluster  or  correctly  putting  the  two  objects 
in  different  clusters,  depending  on  the  "truth.")  This  proportion  serves  as  a 
measure  of  concordance  between  the  clustering  of  the  perturbed  data  set  and  the 
underlying  clustering  structure.  We  expect  that  as  mean  squared  discrepancy 
among  dissimilarities  increases,  the  proportion  of  pairs  correctly  clustered  will 
decrease,  and  the  plot  indicates  this  negative  association.  This  indicates  that  a 
better  estimate  of  the  pairwise  dissimilarities  among  the  data  tends  to  yield  a 
better  estimate  of  the  true  clustering  structure. 

While  this  thesis  focuses  on  cluster  analysis,  several  other  statistical  methods 
are  typically  based  on  pairwise  dissimilarities  among  data.  Examples  include 
multidimensional  scaling  (Young  and  Hamer,  1987)  and  statistical  matching 
(Rodgers.  1988).  An  improved  estimate  of  pairwise  dissimilarities  would  likely 
benefit  the  results  of  these  methods  as  well. 


CHAPTER  2 

INTRODUCTION  TO  FUNCTIONAL  DATA  AND  SMOOTHING 
2.1  Functional  Data 

Frequently,  the  measurements  on  each  observation  are  connected  by  being  part 
of  a single  underlying  continuous  process  (often,  but  not  always,  a time  process). 
One  example  of  such  data  are  the  growth  records  of  Swiss  boys  (Falkner.  1960), 
discussed  by  Ramsay  and  Silverman  (1997.  p.  2)  in  which  the  measurements  are  the 
heights  of  the  boys  at  29  different  ages.  Ramsay  and  Silverman  (1997)  generally 
label  such  data  as  functional  data,  since  the  underlying  data  are  thought  to  be 
intrinsically  smooth,  continuous  curves  having  domain  T,  which  without  loss  of 
generality  we  take  to  be  [0,T].  The  observed  data  vector  y is  merely  a discretized 
representation  of  the  functional  observation  y(t). 

Functional  data  are  related  to  longitudinal  data,  data  measured  across  time 
which  appear  often  in  biostatistica!  applications.  Typically,  however,  in  functional 
data  analysis,  the  primary'  goal  is  to  discover  something  about  the  smooth  curves 
which  underlie  the  functional  observations,  and  to  analyze  the  entire  set  of  func- 
tional data  (consisting  of  many  curves).  The  term  "functional  data  analysis"  is 
attributed  to  Ramsay  and  Dalzell  (1991),  although  methods  of  analysis  existed 
before  the  term  was  coined. 

2.2  Introduction  to  Smoothing 

When  scientists  observe  data  containing  random  noise,  they  typically  desire 
to  remove  the  random  variation  to  better  understand  the  underlying  process  of 
interest  behind  the  data.  A common  method  used  to  capture  the  underlying  signal 
process  is  smoothing. 
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Scatterplot  smoothing,  or  nonparametric  regression,  may  be  used  generally  for 
paired  data  (t,,y,)  for  which  some  underlying  regression  function  /:  [y,j  = /(£,)  is 
assumed.  But  smoothing  is  particularly  appropriate  for  functional  data,  for  which 
that  functional  relationship  y(t)  between  the  response  and  the  process  on  T is 
inherent  in  the  data. 

One  option,  upon  observing  a functional  measurement  y,  is  to  imagine  the 
unknown  underlying  curve  as  an  interpolant  y(t).  This  results  in  a curve  that  is 
visually  no  smoother  than  the  observed  data,  however.  Typically,  when  functional 
data  are  analyzed,  the  vector  of  measurements  is  converted  to  a curve  via  a 
smoothing  procedure  which  reduces  the  random  variation  in  the  function.  If  we 
wish  to  cluster  functional  data,  it  may  be  advantageous  to  smooth  the  observed 
vector  for  each  object  and  perform  the  cluster  analysis  on  the  smooth  curves  rather 
than  on  the  observed  data.  (Clearly,  this  option  is  inappropriate  for  cross-sectional 
data  such  as  the  European  agricultural  data  of  Chapter  1.) 

Smoothing  data  results  in  an  unavoidable  tradeoff  between  bias  and  variance 
(Sitnonoff,  1996,  p.  15).  The  greater  the  amount  of  smoothing  of  a functional 
measurement,  the  more  its  variance  will  decrease,  but  the  more  biased  it  will 
become  (Simonoff,  1996,  p.  42).  In  cluster  analysis,  we  hope  that  clustering  the 
smoothed  data  (which  contains  reduced  noise)  will  lead  to  smaller  within-cluster 
variability,  since  functional  data  which  truly  belong  to  the  same  cluster  should 
appear  more  similar  when  represented  as  smooth  curves.  This  would  help  make 
the  clustering  structure  of  the  data  more  apparent.  Using  smoothed  data  may 
introduce  a bias,  however,  and  the  bias-variance  tradeoff  could  be  quantified  with  a 
mean  squared  error-type  criterion. 

We  denote  the  “observed"  noisy  curves  to  be  y,(t) Jfjv(f).  The  underlying 

signal  curves  for  this  data  set  are  pi((),  — pw( f).  In  reality  we  observe  these 


curves  at  a fine  grid  of  n points,  t|,  so  that  we  observe  N independent 
vectors,  each  n x 1:  yi, . . . ,y«. 

A possible  model  for  our  noisy  data  is  the  discrete  noise  model : 

y«  =/n((,)  + *0,‘  = ' tf-i  = i "•  (2.1) 

Here,  for  each  i = ! N,  e,,  may  be  considered  independent  for  different 

measurement  points,  having  mean  zero  and  constant  variance  o;. 

Another  possible  model  for  our  noisy  curves  is  the  functional  noise  model: 

ttdti)  = Mtj)  + £<(!,), i = 1 N,j  = 1 n,  (2.2) 

where  e,(f)  is.  for  example,  a stationary  Ornstein-Uhlenbeck  process  with  “pull" 
parameter  3 > 0 and  variability  parameter  of.  This  choice  of  model  implies  that 
the  errors  for  the  ith  discretized  curve  have  variance-covariance  matrix  E,  = ir’f! 
where  n,ra  = (2/9)->exp(-0|f,  - tm|)  (Taylor  et  al..  1994).  Note  that  in  this  case, 
the  noise  process  is  functional  — specifically  Ornstein-Uhlenbeck  — but  we  still 
assume  the  response  data  collected  is  discretized,  and  is  thus  a vector  at  the  level 
of  analysis.  Conceptually,  however,  the  noise  process  is  smooth  and  continuous  in 
(2.2),  as  is  the  signal  process  in  cither  model  (2.1)  or  (2.2). 

Depending  on  the  data  and  sampling  scheme,  either  (2.1)  or  (2.2)  may  be  an 
appropriate  model.  If  the  randomness  in  the  data  arises  from  measurement  error 
which  is  independent  from  one  measurement  to  the  next,  (2.1)  is  more  appropriate. 
Ramsay  and  Silverman  (1997,  p.  42)  suggest  a discrete  noise  model  for  the  Swiss 
growth  data,  in  which  heights  of  boys  are  measured  at  29  separate  ages,  and  in 
which  some  small  measuring  error  (independent  across  measurements)  is  likely  to 
be  present  in  the  recorded  data. 

In  the  case  that  the  variation  of  the  observed  data  from  the  underlying  curve  is 
due  to  an  essentially  continuous  random  process,  model  (2.2)  may  be  appropriate. 
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Data  which  arc  measured  frequently  and  almost  continuously  — for  example,  via 
sophisticated  monitoring  equipment  — may  be  more  likely  to  follow  model  (2.2), 
since  data  measured  closely  across  time  (or  another  domain)  may  more  likely  be 
correlated.  We  will  examine  both  situations. 

We  may  apply  a linear  smoother  to  obtain  the  smoothed  curves  /ii((), ....  hs[t)- 
In  practice,  we  apply  a smoothing  matrix  S to  the  observed  noisy  data  to  obtain  a 
smooth,  called  linear  because  the  smooth  fit  can  be  written  as 

<*i  = Sy„  i = 1, ... , /V 

where  S does  not  depend  on  y,  (Buja  et  al..  1989)  and  we  define  ft,  = (/ff(ti), . . . ./,,((„)) . 
Note  that  as  n -*  oc,  the  vector  /if  begins  to  closeiy  resemble  the  curve  /i,(i)  on 

M- 

(Here  and  subsequently,  when  writing  ‘limit  as  n — » oo,"  we  assume 
ti,...,lR  € [0,  TJ:  that  is.  the  collection  of  points  is  becoming  denser  within  [0,T], 
with  the  maximum  gap  between  any  pair  of  adjacent  points  t,_i,  i = 2, . . . ,n, 
tending  to  0.  Stein  (1995)  calls  this  method  of  taking  the  limit  “fixed-domain 
asymptotics,”  while  Cressie  ( 1993)  calls  it  "infill  asymptotics." ) 

Many  popular  smoothing  methods  (kernel  smoothers,  local  polynomial 
regression,  smoothing  splines)  are  linear.  Note  that  if  a bandwidth  or  smoothing 
parameter  for  these  methods  is  chosen  via  a data-driven  method,  then  technically, 
these  smoothers  become  nonlinear  (Buja  et  al.,  1989). 

We  will  focus  primarily  on  basis  function  smoothing  methods,  in  which  the 
smoothing  matrix  S is  an  orthogonal  projection  (i.e.,  symmetric  and  idempotent). 

These  methods  seek  to  express  the  signal  curve  as  a linear  combination  of  k (<  n) 
specified  basis  functions,  in  which  case  the  rank  of  S is  k.  Examples  of  such 
methods  are  regression  splines,  Fourier  series,  polynomial  regression,  and  some 
types  of  wavelet  smoothers  (Ramsay  and  Silverman,  1997,  pp.  44-50). 


2.3  Dissimilarities  Between  Curves 


If  we  choose  squared  £2  distance  as  our  dissimilarity  metric,  then  denote 
the  dissimilarities  between  the  true,  observed,  and  smoothed  curves  i and  j, 
respectively,  as  follows: 


sv= (2.3) 
*,  = / [»(*)-  (2-4) 

= £\Mt)-Mt)?<U-  . (2-5) 


s.i  = y.  - yj, 

= in  - Hj  = se„. 

If  the  data  follow  the  discrete  noise  model,  fly  ~ N(fly,o;)I)  where  a\  = 
(a?  + Oj).  If  the  data  follow  the  functional  noise  model,  fly  — JV(fly,  Ey)  where 
Ey  = ffyfl  and  o,j  = (of  + o'). 

If  we  observe  the  response  at  points  t| , — („  in  (0,  T),  then  we  may  approxi- 
mate (2.3)-(2.5)  by 


= -flyS'sfly. 


Note  that  dy  ->  6,,  where  the  limit  is  taken  as  n -*  oo,  tt,.  ..,i„  € (0,7|. 

Hence  the  question  of  whether  smoothing  aids  clustering  is,  in  a sense,  closely 
related  to  the  question  of:  When,  for  large  n.  is  a better  estimator  of  dy 

than  is  dy? 

(Note:  In  the  following  chapters,  since  the  pair  of  curves  i and  j is  arbitrary, 
we  shall  suppress  the  ij  subscript  on  fly,  fly,  oj,  and  Ey,  writing  instead  9, 
fl,  a2.  and  E,  understanding  that  we  are  concerned  with  any  particular  pair 

i.ie{  I N),i*j.) 

Some  methods  for  clustering  functional  data  recently  have  been  presented  in 
the  statistical  literature,  .lames  and  Sugar  (2003)  introduce  a model-based  cluster- 
ing method  that  is  especially  useful  when  the  points  measured  along  the  sample 
curves  are  sparse  and  irregular.  Tarpey  and  Kinateder  (2003)  discuss  represent- 
ing each  curve  with  basis  functions  and  clustering  via  K-mcans,  to  estimate  the 
"principal  points"  (underlying  cluster  means)  of  the  data’s  distribution.  Abraham 
et  al.  (2003)  propose  representing  the  curves  via  a B-spline  basis  and  clustering 
the  estimated  coefficients  with  a K-means  algorithm,  and  they  derive  consistency 
results  about  the  convergence  of  the  algorithm.  Tarpey  et  al.  (2003)  apply  methods 
for  clustering  functional  data  to  the  analysis  of  a pharmaceutical  study.  Hastie  et 
al.  (1995)  and  James  and  Hastie  (2001)  discuss  discriminant  analysis  for  functional 

2.5  Summary 

It  seems  intuitive  that  in  the  analysis  of  functional  data,  some  form  of  smooth- 
ing the  observed  data  is  appropriate,  and  some  previous  methods  (James  and 
Sugar,  2003;  Tarpey  and  Kinateder,  2003;  Abraham  et  al.,  2003)  do  involve 
smoothing.  Tarpey  and  Kinateder  (2003,  p.  113)  propose  the  question  of  the  effect 
of  smoothing  on  the  clustering  of  functional  data,  citing  the  need  “to  study  the 


sensitivity  of  clustering  methods  on  the  degree  of  smoothing  used  to  estimate  the 
functions." 

fn  this  thesis,  we  will  provide  some  rigorous  theoretical  justification  that 
smoothing  the  data  before  clustering  will  improve  the  cluster  analysis.  Much  of 
the  theory  will  focus  on  the  estimation  of  the  underlying  dissimilarities  among 
the  curves,  and  we  will  show  that  a shrinkage-type  smoothing  method  leads  to  an 
improved  risk  in  estimating  the  dissimilarities.  A simulation  study  will  demonstrate 
that  this  risk  improvement  is  accompanied  by  a clearly  improved  performance  in 
correctly  grouping  objects  into  their  proper  clusters. 


CHAPTER  3 

CASE  t DATA  FOLLOWING  THE  DISCRETE  NOISE  MODEL 

First  we  will  consider  functional  data  following  model  (2.1).  Recall  that  we 
assume  the  response  is  measured  at  n discrete  points  in  [0,  T]. 

3.1  Comparing  MSEs  of  Dissimilarity  Estimators  when  0 is  in  the 
Linear  Subspace  Defined  by  S 

We  assume  our  linear  smoothing  matrix  S is  symmetric  and  idempotent.  For  a 
linear  basis  function  smoother  which  is  fitted  via  least  squares.  S will  be  symmetric 
and  idempotent  as  long  as  the  n points  at  which  fa  is  evaluated  are  identical  to  the 
points  at  which  y,  is  observed  (Ramsay  and  Silverman.  1997,  p.  44).  Examples  of 
such  smoothers  arc  regression  splines  (in  particular.  B-splines).  wavelet  bases,  and 
Fourier  series  bases  (Ramsay  and  Silverman.  1997).  Regression  splines  and  B-spline 
bases  arc  discussed  in  detail  by  de  Boor  (1978)  and  Eubank  (1988,  Chapter  7). 

We  also  assume  that  S projects  the  observed  data  onto  a lower-dimensional 
space  (of  dimension  k < n),  and  thus  r(S)  = fr(S)  = k.  Note  that  S is  a shrinking 
smoother,  since  all  its  singular  values  are  < 1 (Buja  et  al.,  1989).  Recall  that 
according  to  the  discrete  noise  model  for  the  data.  0 ~ N(0, 0*1).  Without  loss 
of  generality,  let  o2!  = I.  (Otherwise,  we  can  let,  for  example,  rj  = o~l8  and 
q = o~'0  and  work  with  q and  q instead.) 

Note  that  ~O0  represents  the  approximate  Lg  distance  between  observed 
curves  |/*(f)  and  j/j(t)  and  ^&  S‘S8  — „0‘S0  represents  the  approximate  La 
distance  between  smoothed  curves  fa(t)  and  /i,(<). 

We  wish  to  see  when  the  "smoothed-data  dissimilarity”  better  estimates 
the  true  dissimilarity  <S,j  between  curves  l‘,(t)  and  Hj(t)  than  the  observed-data 
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dissimilarity.  The  risk  of  an  estimator  f.  for  r is  given  by  R(r,  t)  = £[I(r,  v)j 
where  £(•)  is  a loss  Junction  (see  Lehmann  and  Casella.  1998,  pp.  4-5). 

For  the  familiar  case  of  squared  error  loss  L{t,  = — f)2,  the  risk  is  simply 

the  mean  squared  error  (MSE)  of  the  estimator.  Hence  we  may  compare  the  MSEs 
of  two  competing  estimators  and  choose  the  one  with  the  smaller  MSE.  To  this 
end,  let  us  examine  A/S£(£0'Sfl)  and  MSECS’ 6)  in  estimating  JO'fl  (which 
approaches  tfy  as  n — ► oo). 

In  this  section,  we  consider  the  case  when  0 lies  in  the  linear  subspace  that  S 
projects  onto,  i.e„  SO  = 0.  Note  that  if  two  arbitrary  (discretized)  signal  curves  m 
and  nf  are  in  this  linear  subspace,  then  the  corresponding  0 is  also  in  the  subspace. 


0 = in  - Hj  = S/ij  - Snj  = S(#»j  - Hj)  = SO. 

In  this  idealized  situation,  a straightforward  comparison  of  MSEs  shows  that 
the  smoothed-data  estimator  improves  on  the  observed-data  estimator. 

Theorem  3.1  Suppose  the  observed  0 " S{0,a2 1).  Let  S be  a symmetric  and 
idempatent  tinea r smoothing  matrix  of  rank  k.  IJO  lies  in  the  linear  subspace 
defined  by  S,  then  the  dissimilarity  estimator  gO’SO  has  smaller  mean  squared 
error  than  does  ~6  0 in  estimating  g0*0.  That  is,  the  dissimilarity  estimator  based 
on  the  smooth  is  better  than  the  one  based  on  the  observed  data  in  estimating  the 
dissimilarities  between  the  underlying  signal  curves. 

Proof  of  Theorem  S.l: 

Let  a2  = 1 without  loss  of  generality. 

Now,  E\l0'6\  = lE\B'B\  = £[n  + 0'0]  = T + Jfl’O. 


Similarly,  £[*e'S9]  = ^(9'Sfl]  = |[fc  + 0'SOj  -»  + JtfSS. 

"ss(H  - 

= ^|2n  + 49'ej  + jT+^e'e-^9'ej 
- 

= J*(£+^il®ll,  + l)-  (3.i) 

"“(»  - '{[;**- ;»'»]■} 

- -e»H-M-W 

,,@+>,),{f[i+l|B.IP-IM.]}- 

+ (118911’ -Nell1)’]} 

- '*(?  + 5»S»II*  + 5)  + S[«|8»ll’  - 11*11’)] 

+ 5[(l|s»ll’-ne|l’4  »S 

onto,  implying  that  S9  = 6,  than  MSE{U‘S6)  < MSECS' 6)  since  k < n, 
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On  the  other  hand,  suppose  the  smooth  does  not  reproduce  0 perfectly  and 
that  S0  ^ 0.  Then  it  can  be  shown  (see  Appendix  A.1}  that  the  smoothed-data 
estimator  is  better  when: 

ilSOH1  - II9II3  > — 2 — * - v'4  + 2B  + n»  + 2*.  (3.3) 

Now,  since  S is  a shrinking  smoother,  this  means  ||Sy||  < ||y||  for  all  y,  and  hence 
IlSyll3  < ||y||!  for  all  y.  Therefore,  ||S0||2  < ||0||2  and  the  left  hand  side  of  (3.3)  is 
negative  and  so  is  the  right  hand  side,  If  0 is  such  that  S0  a 0 and  ||S0||2  - ||0||2 
is  near  0,  then  (3.3)  will  be  satisfied.  If,  however,  ||S0||2  - ||0||2  « 0,  then  (3.3) 
will  not  be  satisfied  and  smoothing  will  not  help. 

In  other  words,  some  shrinkage  smoothing  of  the  observed  curves  makes  the 
dissimilarity  estimator  better,  but  too  much  shrinkage  leads  to  a forfeiture  of  that 
advantage.  The  disadvantage  of  the  linear  smoother  is  that  it  cannot  "learn” 
from  the  data  how  much  to  shrink  0.  To  improve  the  smoother,  we  can  employ  a 
James-Stein-type  adjustment  to  S,  so  that  the  data  can  determine  the  amount  of 
shrinkage, 

3.2  A James-Stein  Shrinkage  Adjustment  to  the  Smoother 

What  is  now  known  as  "shrinkage  estimation"  or  “Stein  estimation"  originated 
with  the  work  of  Stein  in  the  context  of  estimating  a multivariate  normal  mean. 
Stein  (1956)  showed  that  the  usual  estimator  (e.g.,  the  sample  mean  vector)  was 
inadmissible  when  the  data  were  of  dimension  3 or  greater.  James  and  Stein  (1961) 
showed  that  a particular  “shrinkage  estimator"  (so  named  because  it  shrunk  the 
usual  estimate  toward  the  origin  or  some  appropriate  point)  dominated  the  usual 
estimator.  In  subsequent  years,  many  results  have  been  derived  about  shrinkage 
estimation  in  a variety  of  contexts.  A detailed  discussion  can  be  found  in  Lehmanu 
and  Casella  (1998,  Chapter  5). 
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Lehmann  and  Casella  ( 1998.  p.  367)  discuss  shrinking  an  estimator  toward  a 
linear  subspace  of  the  parameter  space.  In  our  case,  we  believe  that  0 is  near  S0. 
Let  Cs  = {0  : S0  = 0}  where  S is  symmetric  and  idempotent  of  rank  k.  Hence  we 
may  shrink  6 toward  SO,  the  MLE  of  0 e Cs- 

In  this  case,  a James-Stein  estimator  of  0 (see  Lehmann  and  Casella.  1998, 
p.  367)  is 


where  a is  a constant  and  1 1 • 1 1 is  the  usual  Euclidean  norm. 

In  practice,  to  avoid  the  problem  of  the  shrinkage  factor  possibly  being 
negative  for  small  ||0  - S8||!.  we  will  use  the  positive-part  James-Stein  estimator 


where  i*  = xl(x  > 0). 

Casella  and  Hwang  (1987)  propose  similar  shrinkage  estimators  in  the  context 


unbiased  estimator  toward  a possibly  biased  estimator  using  a James-Stein  form 
can  result  in  a risk  improvement. 

The  shrinkage  estimator  involves  the  data  by  giving  more  weight  to  0 when 
II®  - sell2  is  large  and  more  weight  to  Sfl  when  ||0  - S0||J  is  small.  In  fact,  if  the 
smoother  is  at  all  well-chosen,  SO  is  often  close  enough  to  0 that  the  shrinkage  fac- 
tor in  (3.4)  is  very  often  zero.  The  shrinkage  factor  is  actually  merely  a safeguard 
against  oversmoothing,  in  case  S smooths  the  curves  beyond  what,  in  reality,  it 


(3.4) 


of  confidence  sets  for  a multivariatr 
also  in  the  context  of  estimating  a i 


Green  and  Strawderman  (1991), 
san,  discuss  how  shrinking  an 


So  an  appropriate  shrinkage  i 


of  dij  = ;9'0  is 


d\f]  = Z§(Jsyg(js) 

(!i*(‘-iiraF),s-s4 

Now,  the  risk  difference  (difference  in  MSEs)  between  the  Jomes-Stein 
smoothed  dissimilarity  dljf5’  and  the  observed  dissimilarity  d,j  is: 

= e j - e ^e'e  - ^e'a'j  j . (3,7) 

Since  we  are  interested  in  when  the  risk  difference  is  negative,  we  can  ignore  the 
positive  constant  multiplier  T®/n5. 

From  (3-7), 

A = - 20'00|JS>'0<-'S)  - (0'0)2  + 20'00'0] 

= E[(0<JS>'0<-'S>)2  - (0'0)2  - 20'0(0<JS>'0<JS>  - 0'0)]. 


(3.5) 

(3.6) 


g(jsy g(JS)  _ g'g  _ _O0>(0)'0  + 0V(0)'<#(0)0 

= -20'4^ 
lie- son* 

+ 0'(I-S)'  . ° . — ^-2- 

110  _ cone  HA  . c 


e’(i-s)'(i-s)e  (0-ss||* 

g'(I-S)0  3 Q'(I-S)8 


ff(i-s)6  ||s-sfi||3e'(i-s)# 

= — 2o+  . a . 

lie -se||= 

Hence  the  (scaled)  risk  difference  is 


e'e  = e'(i-s)e  + fl'se 
e’e  = e'(i  - s>e  + e'se. 

Define  the  following  quadratic  forms: 

q,  = 0'(I-S)e 


q,  = 0'(I-S)0 


Note  that  q,  ~ »5Lt(qi),  qi  ~ Xt(qi)  and  they  are  independent  (S  idempotent 
implies  (I  - S)S  = 0 and  6 is  normally  distributed).  Now  write  A as: 


1 = £[(|9i+«|-2o  + ^ -[$i+«s]j- 
= E^-2a  + ^(q,+q,l+ ^-2o  + ^- 


- 2(qi  + qj)  ^-2o  + j J 

^ + •Mqi  + qa)  - ^-(qi  + qa)J 


-¥«H 

= £[-4a[i1+«]  + 4oJ+(gJ-^  + 4o(,I+?1) 

*¥***-.-4 

£($.+*]  = n4-*  + „, 


m > 0,  £[(?,)-]  is. 


where  xl-t  is  a central  with  n-k  degrees  of  freedom.  So  by  replacing  q,  with 
Xn-k  >n  (3-8).  we  obtain  an  upper  bound  Ay  for  A: 

i,  - .»■  - <»  * s[j£j,]  «[£]+■*■('-  ^i)'  M 

Note  that  ■E|(xi!_i)-1|  = l/(n  - * - 2)  and  £((xL.)‘2l  = 1 /(»  - k - 2)(n  - k - 4). 
So  taking  expectations,  we  have: 


- 5-4-2II,- t-e-'-rri-’*  (^fri +«)*’-<-■ 


We  are  looking  for  values  of  a which  make  A u negative.  The  fourth-degree 
equation  Aj;(a)  - o can  be  solved  analytically.  (Two  of  the  roots  are  imaginary,  as 
noted  in  Appendix  B.l.)  One  real  root  is  clearly  0:  call  the  second  real  root  r. 
Theorem  3.2  l/n  — k>  4,  then  the  nontrivial  real  root  r oj  A via)  = 0 is  positive. 
Furthermore,  for  any  choice  a € (0,r),  the  upper  bound  An  < 0,  implying  A < 0. 
That  is,  forO  < a < r,  the  risk  difference  is  negative  and  the  smoothed-data 
dissimilarity  estimator  djfS)  is  better  than  d,,. 

Proof  of  Theorem  3.2:  Write  A„(a)  = 0 as  cjo4  + Cja3  + eja=  + c,a  = 0,  where 
C4,  C3,es,  ci  are  the  respective  coefficients  of  A0(a).  It  is  clear  that  if  n-  k > 4, 
then  cr  > 0,ci  < 0,cj  > O.Ci  < 0. 

Note  that  r also  solves  the  cubic  equation  /e(a)  = iqa3  + c3as  + C20  + ci  = 0. 
Since  its  leading  coefficient  c,  > 0, 

Jim^/efa)  = -00,  lim  /,( a)  = 00. 

Since  !,{a)  has  only  one  real  root,  it  crosses  the  o-axis  only  once.  Since  its  vertical 
intercept  C|  < 0,  its  horizontal  intercept  r > 0. 


Table  3-1:  Table  of  choices  of  a for  various  n and  k. 


A (/{a)  has  leading  coefficient  <■,  > 0, 


Jirn^  A,/(a)  - oo. 

Since  it  tends  to  oo  at  its  endpoints,  A u(a)  must  be  negative  between  its  two  real 
roots  0 and  r.  Therefore  A < 0 for  0 < a < r.  O 

Using  a symbolic  algebra  software  program  (such  as  Maple  or  Mathematical, 
one  can  easily  obtain  the  formula  for  the  second  real  root  for  general  n and  k 
(the  formula  is  given  in  Appendix  B.l)  and  verify  that  the  other  two  roots  are 
imaginary. 

Figure  3-1  shows  A u plotted  as  a function  of  a for  varying  n and  * = 5. 

For  various  choices  of  n (and  k = 5)  Table  3-1  provides  values  of  r,  as  well  as 
the  value  of  a which  minimises  the  upper  bound  for  A.  Fbr  0 < a < r,  the  risk 
difference  is  assured  of  being  negative.  For  a‘,  A v is  minimized. 

Since  Au  provides  an  upper  bound  for  the  (scaled)  risk  difference,  it  may 


estimate  this  discrepancy  via  Monte  Carlo  simulation.  We  generate  a large  number 
of  random  variables  having  the  distribution  of  q,  (namely  xf-M)  and  get  an 
estimate  of  A using  a Monte  Carlo  mean.  For  various  values  of  q,.  Figure  3-2 
shows  A plotted  alongside  Ay. 

3.3  Extension  to  the  Case  of  Unknown  a2 
In  the  previous  section,  o2  was  assumed  to  be  known.  We  now  examine  the 
situation  in  which  8 ~ N{8,  a1!)  with  a1  unknown. 
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Figure  3-1:  Plot  of  Av  against  a for  varying  n and  for  k = 5. 
Solid  line:  n = 20.  Dashed  line:  n = 50.  Dotted  lino:  n = 100. 
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Figure  3-2:  Plot  of  simulated  A and  A « against  n for  n = 20,  k = 5. 

Solid  line:  Upper  bound  A y.  Dashed  line:  simulated  A,  qy  = 0.  Dotted  line: 
simulated  A,  qy  = 135.  Long-dashed  line:  simulated  A,  fji  — 540.  Dot-dashed  line: 
simulated  A,  q,  = 3750000. 


Suppose  there  exists  a random  variable  S 2.  independent  of  0,  such  that 


Then  let  a2  = SP/v. 

Consider  the  following  definition  of  the  James-Stein  estimator  which  accounts 

s”  ' si+(‘-l5rafr)|I-s,< 

Note  that  the  .lames-Stein  estimator  from  Section  3.2  (when  we  assumed  a 
known  covariance  matrix  I)  is  simply  (3.10)  with  a2  = l. 

Now,  replacing  a2  with  the  estimate  a1,  define 

eiJS,  = sfl+  fi~^VI-S)e, 

letting  q,  = ^(I  - S)fl  as  in  Section  3.2.  Write 


where 


Define  the  James-Stein  smoothed-data  dissimilarity  estimator  based  on  0^JS* 
be: 


Then  analogously  to  (3.7). 


Now,  elJSyglJS>  = e'e  - 2^0  + «'*.  So 

a4  = - e'e  - fy'S+t'tf  - ie'g-  e'e)2j 

= £^(0'0  - 0'0)  - (2^‘ fl  - -(fl'«-0'0): 

= E p2(2«'S  - «'«)(0'e  - 0'0)  + (2(40  - «»’! . 


«>'*  = - S)'(I  - S)0  = 

So 

a,  = £^-2(200’  - ^)(9'e  - e'e)  + (200’  - j. 

Since  a and  0 are  independent, 

£I-4od!(0'0  - 0'0|]  = E[-4ad!]£(0‘0  - 0'0]  = E[-W]o!n 


e[^(0'0-0'0)] 

■ e[^(0-(I  - S,0  + 0'S0  - 0’(I  - S,0  - 0'S0)J 

■ B|2oV)e[|] +e[^(* -«,-»)]. 


by  the  independence  of  a and  0 (and  thus  a and  qi). 


= *poP^(*  + ^*)*|I/«  + J!p^ft -»-*)].  (3.12) 


in  * and  q,  is 


(3.12)  < S(2a^](ft+a’*)E(l/A]  + s[^]B(4,-,1-,a] 

= E[2aV](« + n3t)E|l/i,]  + £[2aV]E(l/^|(aJ(n  -*)-*) 


* s )■] 

. H-M V.  + S[!.-S‘1S  [I]  A + *[*,•»<  - ^ " ] . 


E[aa2]  = no3 

£(oV|  = 

E[aV]  = 

nn . „v(!fca#^ifc ts 


)■ 


4,  < -w.,+4.v.(^)E[i]^v(*ya) 


DeBne  « = q>/o*  = («/*)’( I - S)(0/<r).  Since  B ~ W(0,oJI),  (tf/n)  ~ AT«,I) 
where  C = S/a.  Then  the  risk  difference  is  a function  of  (,  and  the  distribution  of 


(I  - S)I  is  iderapotent,  « is  noncentral  x'lk«'(J  - S)<).  Recall 


= -too  + 2aJn  E{(£_tr']  + 4nJ 

-v(*±*±i>)e[(i.,r.l 
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,(v(v  + 2)[v  + i)\  a2 
V ^ )n-k-2 

/i'(t'  + 2)(y  + 4)(i/  + 6)\ a* 

V v<  ) (n-k-  2)(n  — t — 4) 

_ / u(v  + 2)(a  + i){u  + 6)  \ , .4y(y  + 2)(i/  + 4)\  3 

\i^(n-fc-2)(B-ft-4)/0  iA(n-fc-2)  )a 
( 2a(i/  + 2)  4k(i/  + 2)\  s 

+ W(„-i-2)+  *2  )a 

Note  that  if  A u (which  does  not  involve  a)  is  less  than  0,  then  As  < 0,  which  is 
what  we  wish  to  prove. 

Theorem  3.3  Suppose  that  0 ~ N(0,o2l)  with  a2  unknown  and  that  there  exists 
a random  variable  S2.  independent  oj  §,  such  that  S2/o2  ~ xl,  and  let  a2  = S2/ v. 
Ifn  — k > 4.  then  the  nontrivial  real  root  r of  A (/(e)  — 0 is  positive.  Furthermore, 
for  any  choice  a 6 (0,  r),  the  upper  bound  Ay  < 0.  implying  A < 0.  That  is, 
for  0 < a < r,  the  risk  difference  is  negative  and  the  smoothed-data  dissimilarity 
estimator  dJjJJ1  is  better  than  d,,. 

Proof  of  Theorem  3.3:  Note  that  A y(a)  = 0 may  be  written  as  cju1  + c3a3  + 
caa2  + c\a  = 0.  with  c<  > 0,c3  < 0.  C2  > 0.  C(  < 0.  The  proof  then  follows  exactly  as 
the  proof  of  Theorem  3.2  in  Section  3.2.  □ 

3.4  A Bayes  Result:  and  a Limit  of  Bayes  Estimators 

In  this  section,  we  again  assume  S,  the  smoothing  matrix,  to  be  symmetric  and 
idempotent.  We  again  assume  0 ~ N(0,  o2 1)  and  again,  without  loss  of  generality, 
let  o2l  = I.  (Otherwise,  we  can  let,  for  example,  rj  = a 'B  and  i\  = o~'0  and 
work  with  »j  and  r\  instead.)  Also,  assume  that  0 lies  in  the  linear  subspace  (of 
dimension  k < n)  that  S projects  onto,  and  hence  SO  = 0.  We  will  split  f{O\0) 
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into  a part  involving  (I  - S )9  and  a part  involving  SO 


/($!»)«  axp(-(l/2)(9- »)'(»- 9)| 

= exp[-(l/2)(9-9)'(I-S  + S)(9-9)) 

= exp{-(l/2)[d'(I  - S)9  + (9  - 9)S(9  - 9)]} 


which  has  a part  involving  (I  - S)9  and  a part  involving  S9. 

Since  S is  symmetric,  then  for  some  orthogonal  matrix  P,  PSP'  is  diagonal. 
Since  S is  idempotent,  its  eigenvalues  are  0 and  1 . so  PSP  = B where 


Let  p',,  — p'„  be  the  rows  of  P.  Then  9"  = P9  = (p,9. . . . . pk9. (I. . . .,0)'. 
And  let  9 be  the  vector  containing  the  k nonzero  elements  of  9‘.  Similarly, 

9-  = P9. 

Now  we  consider  the  likelihood  for  9: 

L(9|9)  oc  exp{-(l/2)[9‘(I  - S)9  + (9  — 9)S(9  — 9)|} 

« exp{-(l/2)[($  - 9)  S(9  - 9)|} 

= exp{-(l/2)[(d  - 9)  P'PSP'P(9  - 9)]} 

= exp{-(l/2)[(9*  - 9-)'PSP'(9-  - 9-)]} 

= exp{-(l/2)[(«-  - 9-)'B(9-  - 9-))}. 

Now  we  put  a iV(0,  V)  prior  on 


x(9-)aexp[-(l/2)9-'V-9-|, 


lofflV 


of  0.  Note  that  this  I 


W|(B  + V-)-BP#,(B  + V-)-]. 

Thus  the  posterior  expectation  of  is 

*(H  - -*(:•"*) 

= + V-)-(B  + V-)-B$-  + tr  (|l(B  + V)") 

. IfPB(B  + VT(B*V-,-Br«  + *0+V-|-) 

= ^9,SP'(B  + V-)-(B  + V-)-PS9  + tr^(B  + V-)-). 
Let  us  choose  the  prior  variance  so  that  V = JB.  Then 

*(H  - i 

• fs'sp'(rrrB)(^TB)ps“+''(KsiiB)) 


mi 


and  since  P'BP  = S, 


w = l0'sse  in  the  sense  that 


dij  = le‘e.  the  approximate  distance  between  curve  i and  curve  j,  which  in  the 


CHAPTER  4 

CASE  II:  DATA  FOLLOWING  THE  FUNCTIONAL  NOISE  MODEL 

Now  we  will  consider  functional  data  following  model  (2.2).  Again,  we  assume 
the  response  is  measured  at  n discrete  points  in  |0,  T],  but  here  we  assume  a 
dependence  among  errors  measured  at  different  points. 

4.1  Comparing  MSEs  of  Dissimilarity  Estimators  when  0 is  in  the 
Linear  Subspace  Defined  by  S 

As  with  Case  I,  we  assume  our  linear  smoothing  matrix  S is  symmetric  and 
idempotent,  with  r(S)  = !r(S)  = k.  Recall  that  according  to  the  functional  noise 
model  for  the  data.  0 — N{0,  E)  where  the  covariance  matrix  E corresponds,  e.g., 
to  a stationary  Ornstein-Uhlenbcck  process. 

Note  that  ~0'0  represents  the  approximate  La  distance  between  observed 
curves  yt(t)  and  y} ( f ) , and  that,  with  au  Ornstein-Uhleubeck-type  error  structure,  it 
approaches  the  exact  distance  as  n -»  oo  in  a fixed-domain  sense. 

Also,  ~0  S'S0  --  1 OS0  represents  the  approximate  L-1  distance  between 
smoothed  curves  p,(l)  and  /ij(f),  and  this  approaches  the  exact  distance  as  n -t  oo. 

We  wish  to  see  when  the  smoothed-data  dissimilarity  better  estimates  the  true 
dissimilarity  Sy  between  curves  p,(()  and  p,(()  than  the  observed-data  dissimilarity. 
To  this  end  let  us  examine  MSE(^O'S0)  and  MSE(^0'  0)  in  estimating  %0'0 
(which  approaches  6y  as  n —r  oo). 

In  this  section,  we  consider  the  case  in  which  0 lies  in  the  linear  subspace 
defined  by  S,  i.e.,  Sfl  = 0.  We  now  present  a theorem  which  generalizes  the  result 
of  Theorem  3.1  to  the  case  of  a general  covariance  matrix. 

Theorem  4.1  Suppose  the  observed  0 iV(0,  E).  Let  S he  a symmetric  and 

idempotent  linear  smoothing  matrix  oj  rank  k.  1/0  lies  in  the  linear  subspace 
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error  than  does  in  i 


■ 

= £ [2«r(Sl)  + -le'Sfl]  + j ~tr(S)  + 1 9’9  - 
= J{ar(EJ)+4fl'E9  + |(r(E)]»}.  (4.1) 

-O)  ■ 

- -(»+{«[>]  -H 

- £[*«  + *>»•]  * {^lr(SB)  + >'»  - f «}' 

= 5{ar[(SSffl  + 49’E9  + [tr(S2))J}.  (4.2) 

Hence,  ,rwe  compare  (4.2)  with  (4.1),  we  must  show  that  tr(SE)  < tr(E)  and 
‘r l(SE)*|  < /r((E)»]  to  complete  the  proof. 

tr(E)  = lr(SE  + (I-S)E) 

= (r(SE)+tr((I-S)E) 


> tr(SE) 


the  last  term  is  the  sum  of  the  squared  elements  of  (I  - S)£1/3. 


tr(£3)  = fr((S£  + (I  - S)£)3] 

= tr((S£)3]  + tr(S£(I  - S)£]  + tr((I  - S)£S£)  + tr[((I  - S)£)3] 

= (r|(S£)3]  + tr[SS£(I  - S)(I  - S)£)  + (r|£(I  - S)(I  - S)£SS] 

+ tr|(I-S)£(I-S)£] 

= trl(SE)3]  + 2tr[S£(I  - S)(I  - S)£S] 

+ tr|£1/3(I  - S)£,/3£,/3(I  - S)£1/3] 

> h-KSE)3] 

since  tr(S£(I  - S)(I  - S)£S)  is  the  sum  of  the  squared  elements  of  (I  - S)£S  and 
the  last  term  is  the  sum  of  the  squared  elements  of  £I/2(I  - S)£*^3. 

Hence  MSE(%0'SB)  < MSEfiffe).  Since  (I  - S)£l/2  (which  has  rank  n-k) 
is  not  the  zero  matrix,  the  inequality  is  strict.  □ 

4.2  Jarnes-Stein  Shrinkage  Estimation  in  the  Functional  Noise  Model 
Now  we  will  consider  functional  data  following  model  (2.2),  and  in  this 
section  we  do  not  assume  6 lies  in  the  subspacc  defined  by  S.  Again,  we  assume 
the  response  is  measured  at  n discrete  points  in  [0,T],  but  here  we  assume  a 
dependence  among  errors  measured  at  different  points. 

In  Section  3,2,  we  obtained  an  exact  upper  bound  for  the  diiference  in  risks 
between  the  smoothed-data  estimator  and  observed-data  estimator.  In  this  section 
we  will  develop  an  asymptotic  (large  n)  upper  bound  for  this  difference  of  risks. 

The  upper  bound  is  asymptotic  here  in  the  sense  that  the  bound  is  valid  for 
sufficiently  large  n,  not  necessarily  that  the  expression  for  the  bound  converges  to  a 
meaningful  limiting  expression  for  infinite  n. 
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Note  that  as  the  number  of  measurement  points  n grows  (within  a fixed 
domain),  assuming  a certain  dependence  across  measurement  points  (e.g.,  a corre- 
lation structure  like  that  of  a stationary  Ornstein-Uhlenbeck  process),  the  observed 

data  y, y.v  closely  resemble  the  pure  obsorved  functions  y:  (f), ....  jfw(0  on 

[0,7].  Therefore  this  result  will  be  most  appropriate  for  situations  with  a large 
number  of  measurements  taken  on  a functional  process,  so  that  the  observed  data 
vector  is  “nearly"  a pure  function. 

As  with  Case  I,  we  assume  our  linear  smoothing  matrix  S is  symmetric  and 
idempotent,  with  r(S)  = tr( S)  = k.  Recall  that  according  to  the  functional  noise 
model  for  the  data,  6 ~ i V(0,  S)  where  £ is,  for  example,  a known  covariance 
matrix  corresponding  to  a stationary  Ornstein-Uhlenbeck  process.  In  this  section, 
we  will  assume  a Gaussian  error  process  whose  covariance  structure  allows  for  the 
possibility  of  dependence  among  the  errors  at  different  measurement  points. 

We  consider  the  same  James-Stein  estimator  of  0 as  in  Section  3.2,  namely 
(in  practice,  again,  we  use  the  positive-part  estimator  8+S>),  and  the  same 
James-Stein  dissimilarity  estimator. 

Recall  the  definitions  of  the  following  quadratic  forms: 

9,  = <?'(I-S  )6 

* = e'se 

91  = e'(i-s)0 

92  = o's  e. 

Unlike  in  Section  3.2  when  we  assumed  an  independent  error  structure,  under 
the  functional  noise  model,  91  and  92  do  not  have  noncentral  x2  distributions,  nor 


: they  independent  in  general. 


In  this  same  way  as  in  Section  3.2.  .we  may  write  A as: 


£1*+*)  = «.+(b  + tr[(I-S)E]  + /.r[SEl 


6.  = -4aW[E)  + 4b’  + E [(^)  - ~ + £{i.+ & - a.  - «)]- 

'[fM^ 


In  our  CM*,  (I  - S)  is  ni 


£1l]  - '[fSis]  • lM%)  ■ i,?*[5-Ssr 


Hence  we  have  a large  n approximation,  or  asymptotic  expression,  for  A: 


'*+tr[(I-S)£] 


q,  + tr(SX) 


2n>(ft  + tr(SS))  _ 1a2q\  _ 2a1q2] 
„+tr[(I-S)£]  q,  q,  J 


+ M(I-S)£] 


2a  + q,  +<h' 


!)]' 


Since  by  Jensen’s  Inequality,  E(l/qi)  > l/E(q\),  we  can  bound  the  last  term  above: 


Denote  the  eigenvalues  of  £,/J(I-S)£l/J  by  ci,...,<V  Since  (I-S)  is  positive 

semidefinitc,  it  is  clear  that  C| are  all  nonnegative.  Note  that  C|,...,Cn  are 

also  the  eigenvalues  of  (I  - S)£,  since  E1/3(I  - S)£1/J  = £l/a(I  - S)££-1/2  and 
(I-S)£  arc  similar  matrices.  Note  that  since  r|£l/2(I-S)E1/s]  = r(I-S)  = n-k, 
£'/!(I  - S)E1/2  has  n - k nonzero  eigenvalues,  and  so  does  (I  - S)£. 

It  is  well  known  (see,  e.g.,  Baldessari,  1967:  Tan,  1977)  that  if  y ~ N[p,  V) 
for  positive  definite,  nonsingular  V,  then  for  symmetric  A,  y' Ay  is  distributed  as  a 
linear  combination  of  independent  noncentral  Xs  random  variables,  the  coefficients 
of  which  are  the  eigenvalues  of  AV. 


,(  <r(S)-2a  \ 

\7,  + (r[(I  — S)£]/ 


Hence  we  have  the  asymptotic  upper  bound  for  A: 


y,  since  * = 9'(I  - S)9.  and  (I-S)Ehaseif 


ci  If  >0  which  is  zero  when  9 = 


for  any  <Sf  > 0, 

V*>0. 

PIWW)  + ■ ■ • + wit®  > *]  > Pfox?  + • • + o»xf  > *1  V i > 0. 
Hence  for  all  x > 0,  />,*,[*  > x]  > P0[qt  > x],  i.e.,  the  distribution  of*  with 
9*0  stochastically  dominates  the  distribution  of  * with  9 = 0.  Then 

BWI(*)-m|  < £o[(*rm] 

for  m = 1,2 

Letting  M,  = £0[(*)-3),  we  have  the  following  asymptotic  upper  bound  for  A 
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£[(EL,  ctX?)-2]-  We  see  that  if  n - * > 4,  then  M,  exists  and  is 

='[(£)’]— =*[(£)!• 
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where  ielnl„  is  the  smallest  and  uw  the  largest  of  the  n-k  nonzero  eigenvalues  of 
(I-S)S. 

As  was  the  case  in  the  discrete  noise  situation,  AJ,(a)  = 0 is  a fourth-degree 
equation  with  two  real  roots,  one  of  which  is  trivially  zero.  Call  the  nontrivial  real 

Theorem  4.2  lfn-k>  4,  then  the  nontrivial  real  rool  r of  AJ,(a)  = 0 is  positive. 
Furthermore,  for  any  choice  a 6 (0,r),  the  asymptotic  upper  bound  Ap  < 0, 
implying  A < 0.  That  is,  for  0 < a < r,  and  for  sufficiently  large  n,  the  risk 
difference  is  negative  and  the  smoothed-data  dissimilarity  estimator  d\jS*  is  better 

Proof  of  Theorem  f.S:  Since  E is  positive  definite,  (r(E)  is  positive.  Since 
Ms  > 0,  we  may  write  An  (a)  = 0 as  tqa*  + c3a3  + cjaJ  + c,a  = 0,  where 
Ct  > 0,cj  < 0.C2  > O.Ci  < 0.  The  proof  then  follows  exactly  from  the  proof  of 
Theorem  3.2  in  Section  3.2.  O 

As  with  the  discrete  noise  situation,  one  can  easily  verify,  using  a symbolic 
algebra  program,  that  two  of  the  roots  are  imaginary,  and  can  determine  the 
nontrivial  real  root  in  terms  of  Ms,  tr  j(I  — S)E],  and  tr(E). 

As  an  example,  let  us  consider  a situation  in  which  we  observe  functional  data 
y,,....y«  measured  at  50  equally  spaced  points  ti , . . . , fso>  one  unit  apart.  Here, 
let  us  assume  the  observations  are  discretized  versions  of  functions  jq(t),  • - • , 
which  contain  (possibly  different)  signal  functions,  plus  a noise  function  arising 
from  an  Omstein-Uhlenbeck  (O-U)  process.  We  can  then  calculate  the  covariance 
matrix  of  each  y,  and  the  covariance  matrix  E of  0.  Under  the  O-U  model, 
tr(E)  = u always,  since  each  diagonal  element  of  S is 

For  example,  suppose  the  O-U  process  has  a1  = 2 and  0=1.  Then  in  this 
example,  tr( E)  = 50.  Suppose  we  choose  S be  the  smoothing  matrix  corresponding 
to  a B-spline  basis  smoother  with  6 knots,  dispersed  evenly  within  the  data.  Then 


we  can  easily  calculate  the  eigenvalues  of  (I  - S)E,  which  are  cu . . . , c„,  and  via 
numerical  or  Monte  Carlo  integration,  we  find  that  A/2  = 0.0012  in  this  case. 

Substituting  these  values  into  Ay  (a),  we  see,  in  the  top  plot  of  Figure  4-1, 
the  asymptotic  upper  bound  plotted  as  a function  of  o.  Also  plotted  is  a simulated 
true  A for  a variety  of  values  (n-vectors)  of  0:  Ox  1.  (0,1,0, 1,0, ,..,0,1)',  and 
(— 1,0, 1,-1, 0, 1,...,  -1,0)',  where  1 is  a n-vector  of  ones.  It  should  be  noted  that 
when  6 is  the  zero  vector,  it  lies  in  the  subspace  defined  by  S,  since  S6  = 0 in  that 
case,  The  other  two  values  of  0 shown  in  this  plot  do  not  lie  in  the  subspace.  In 
any  case,  however,  choosing  the  a that  optimizes  the  upper  bound  guarantees  that 
djys>  lias  smaller  risk  than  dy. 

Also  shown,  in  the  bottom  plot  of  Figure  4-1,  is  the  asymptotic  upper  bound 
for  data  following  the  same  Ornstein-Uhleubeck  model,  except  with  n = 30. 
Although  AJ,  is  a largc-n  upper  bound,  it  appears  to  work  well  for  the  grid  size  of 
30. 

4.3  Extension  to  the  Case  when  a 2 is  Unknown 

In  Section  3.2,  it  was  proved  that  the  smoothed-data  dissimilarity  estimator, 
with  the  shrinkage  adjustment,  dominated  the  observcd*data  dissimilarity  estimator 
in  estimating  the  true  dissimilarities,  when  the  covariance  matrix  of  0 was  o* I, 
a known.  In  Section  3.3,  we  established  the  domination  of  the  smoothed-data 
estimator  for  covariance  matrix  <rsI,  a2  unknown. 

In  Section  4.2,  we  developed  an  analogous  asymptotic  result  for  general  known 
covariance  matrix  E.  In  this  section,  we  extend  the  asymptotic  result  to  the  case 
of  0 having  covariance  matrix  of  the  form  V = <raE,  where  a2  is  unknown  and  E 
is  a known  symmetric,  positive  definite  matrix.  This  encompasses  the  functional 
noise  model  (2.2)  in  which  the  errors  follow  an  Ornstein-Uhlenbeck  process  with 
unknown  a1  and  known  0.  (Of  course,  this  also  includes  the  discrete  noise  model 


Otfuu  for  O-U  Example  In  = s 


I 


Figure  4-1:  Plot  of  asymptotic  upper  bound,  and  simulated  A's,  for  Omstein- 
Uhlenbeck-type  data. 

Solid  line:  plot  of  A;,  against  a for  above  O-U  process.  Dashed  line:  simulated  A, 

“ • 1.  Dotted  line:  simulated  A,  6 = (0,1, 0,1,0,..., 0,1)'.  Dot-dashed  line: 
$ 1.0, 1.  1,0,1,_  . . 1.0)'.  Top:  n = 50.  Bottom:  n = 30. 


V = a2!,  but  this  I 


kB~  N(B,\),  with  V = v-S.  Suppose,  as  in 


B(-4nff3(e'e  - B'B) J = B[-W|£|$’$  - B'B\  = £[-4ud2lo,ir(E) 


x of  it  and  8 (and  thus  d and  q,).  We  use  the  (large  n) 


for  (4.3): 


(4.4) 


i?4  and  * 


■ *"Misr'[s]]+ 


+ £p^]<*r|(I-S)E]. 


a;„  - S[  W^l+^Vlj^llLj  + jfS^A!  (I-S)Ejj 

«[(--?)'] 

. b[-mM,(E|  t J^At n - s)s] + W *“> 


Recall  chat 


**  - '( 

«[.W)  - ^(«fe±Sfe*llfeiS). 


As  in  Section  3.3.  define  « = qja*  = (*/o)'(I  - S )(<?/*).  Here,  since 
6 - W(fl,o3E),  (6/a)  - iV(C,E)  where  < = Oja.  Then  the  risk  difference  is  a 
function  of  C,  and  the  distribution  of?;  is  a function  off.  Then 

* - 


As  shown  in  Section  4.2.  for  ail  * > 0,  Petal'll  > xj  > Po(«i  > at].  Note  that 
this  is  equivalent  to:  For  all  x > 0,  Petal'll  > x]  > > xj.  Then 

£<,«.|(?;)-m]  < £»[(«)""! 

form  = 1.2. 

Let  M;  = £0[(?|T‘]  “d  = Eo[WtTsl-  Then 

^ < -4(r(S)n  -e  2))*/itr|(I  - S)Slaa  + 2| 

Ju(v  + 2)(p  + 4) 


'o(v  + 2)\  tr(SS) 


+<(^).--4(*±ar±«)«.. 

+ ^(t'  + 2)(l/  + 4)(l/  + 6)  j M.a, 

/t/(t/-t-2)(i/  + 4)(./  + 6))  M.a<  _ A ^(i/  + 2)(i/  + 4))  M.a, 

-(2Atf|tr[(I  - S)E)  + + 4))o*  - 4tr(E)a. 


Note  that  if  this  last  expression  (which  does  not  involve  a and  which  we  denote 
simply  as  Ap(a))  is  less  than  0.  then  A?v  < 0.  which  is  what  we  wish  to  prove. 

We  may  repeat  the  argument  from  Section  4.2  in  which  we  showed  that  when 
0 = 0.  q,  ~ c,x;,  replacing  g,  with  q{,  8 with  0/a.  and  0 with  C-  Thus  we 

conclude  that  when  ( = 0.  g',  ~ V"_,  v,tf,  where  r, o„  are  the  eigenvalues  of 

(I-S)E. 

Again,  if  n — k > 4,  then  M/  and  exist  and  are  positive,  as  was  shown  in 
Section  4.2. 

Again.  Ao(a)  = 0 is  a fourth-degree  equation  with  two  real  roots,  one  of  which 
is  trivially  aero.  Call  the  nontrivial  real  root  r. 

Theorem  4.3  Suppose  the  same  conditions  as  Theorem  3.3,  except  generalise 
the  covariance  matrix  of  0 to  be  a2 S,  a2  unknown  and  S known,  symmetric,  and 
positive  definite.  Ifn-k  >4,  then  the  nontrivial  real  root  r of  AJ„(a)  = 0 
is  positive.  Furthermore,  for  any  choice  a 6 (0.  r),  the  asymptotic  upper  bound 


A'su  < 0.  implying  A < 0.  That  is.  for.O  < a < r.  and  lor  sufficiently  large  n. 
the  risk  difference  is  negative  and  the  smoothed-data  dissimilarity  estimator  d'/.f*  is 
better  than  dtJ. 

Proof  of  Theorem  f.3:  Since  £ is  positive  definite.  tr(£)  is  positive.  Since 

> 0 and  Ml  > 0.  we  may  write  Av(a)  = U as  r.,a'  + C3<t3  + cjaJ  + do  = 0. 
where  c«  > O.cj  < O.ca  > O.ci  < 0.  The  proof  is  exactly  the  same  as  the  proof  of 
Theorem  4.2  in  Section  4.2.  □ 

4.4  A Pure  Functional  Analytic  Approach  to  Smoothing 

Suppose  we  have  two  arbitrary  observed  random  curves  yt{t)  and  y;(I),  with 
underlying  signal  curves  /r,(t)  and  fij(l).  This  is  a more  abstract  situation  in  which 
the  responses  themselves  are  not  assumed  to  be  discretized,  but  rather  come  to 
the  analyst  as  "pure"  functions.  This  can  be  viewed  as  a limiting  case  of  the 
functional  noise  model  (2.2)  when  the  number  of  measurement  points  n -*  oo  in  a 
fixed-domain  sense.  Throughout  this  section,  we  assume  the  noise  components  of 
the  observed  curves  are  normally  distributed  for  each  fixed  t.  as,  for  example,  in  an 
Omstein-Uhlenbeck  process. 

Let  C[0.  T]  denote  the  space  of  continuous  functions  of  t on  [0,  T\.  Assume  we 
have  a linear  smoothing  operator  S which  acts  on  the  observed  curves  to  produce 
a smooth:  /i((l)  = Syi(t),  for  example.  The  linear  operator  5 maps  a function 
/ S C(0,r]  to  C[0.r)  as  follows: 

S[f]{t)  = j\(t.,)ns)ds 

(where  K is  some  known  function)  with  the  property  that  S[af  + 0g]  = qS[/|  + 
0S[s]  for  constants  a.B  and  functions  f.g  e C(0,T|  (DeVito.  1990,  p.  66). 

Recall  that  with  squared  L?  distance  as  our  dissimilarity  metric,  we  denote 
the  dissimilarities  between  the  true,  observed,  and  smoothed  curves  i and  j, 
respectively,  as  follows: 


(4.5) 
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<y  = l MO  - M <)]*  dt  = IMO  - /i,(OII’ 

«y  = l I«(‘)  - »(01a  dt  = ||iMO  - wlOII*  (4.6) 

4™”“'  = J 1A(0  - MOP*  = l|S»(0  - S»(t)|p.  (4.7) 

Recall  that  for  a set  of  points  t i„  in  (0,  Tj.  we  may  approximate  (4.5)- 


dy  = ^B'0 

di,  = -$'B 

4;™“’  = ^s'sse 

where  d,,  -*  itj  when  the  limit  is  taken  as  n ->  oo  with  t|, . . . ,f,  € (0,r). 

In  this  section,  the  smoothing  matrix  S corresponds  to  a discretized  analog  of 
S.  That  is,  it  maps  a discrete  set  of  points  y(t,), . . ,,y{tn)  on  the  “noisy"  observed 

curve  to  the  corresponding  points  ji(fi) £((„)  on  the  smoothed  curve  which 

results  from  the  application  of  S.  In  particular,  in  this  section  we  examine  how 
the  characteristics  of  our  self-adjoint  S correspond  to  those  of  our  symmetric  S as 
n — t oo  in  a fixed-domain  sense. 

Suppose  S is  self-adjoint.  Then  for  any  x,  y, 

(*-sv)  = fo  *)J  W.s)y{s)d,dt  = (Sx,y)  = f f K(t,s)x{s)dyy(t)dt. 

Therefore  for  points  t| („  6 (0,  T),  with  n ->  oo  in  a fixed-domain  sense, 

Ji,™  £ [I  x(t,Mt,,s)y(,)  ds  + • • - + j* x(tn)K{tn,  s)p(s)  dsj 

= [/  K{tus)x{s)y(tl)d*  + - + jf  K’(fmj)x(a)v(t„)d*j.  (4.8) 


Now. 
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for  a symmetric  matrix  S.  for  .any  x,y,  (x.Sy)  = (Sx.y)  « x Sy  - 

x'S'y-  At  the  sequence  of  points  t, I„,  x = (x(t,) x(t„))',  and  for  our 

smoothing  matrix  S. 

Sy  = (S|p](l, ) S[» 1W)'=  (jf  K(tl.s)y(s)ds....,j\(tn.s)y{s)ds^  . 

XSy  = l *)»(*) * + •••  + / x(t„)K(t„,s)y(s)ds 

= x'Sy  = K(t,,a)x(i)y[t,)ds+  - + J K(t„,s)x(s)y[t„)ds 


ITM[1  a(tl)K(h,a)y(a)da  - f-  J x(fn)/f(i„,s)i/(»)d*j 

= P7n)[jf  K{tl,s)X{s)y(tl)da+  - +J^  *)*(*)»(«.)  ds] 

So  taking  the  (fixed-domain)  limits  as  n -*  oo,  we  obtain  (4.8),  which  shows  that 
the  defining  characteristic  of  symmetric  S corresponds  to  that  of  self-adjoint  S. 
Definition  4.1  S is  called  a shrinking  smoother  l/||Sy||  < ||y||  far  all  elements  y 
(Baja  el  of.,  1989). 

A sufficient  condition  for  5 to  be  shrinking  is  that  alt  of  the  singular  values  of 
S be  < 1 (Buja  et  al.,  1989). 


Lemma  4.1  Let  A be  a symmetric  matrix  and  £ be  a symmetric,  positive  definite 


matrix.  Then 


x'ASAx 
S“P  x'Ex 


< [e™(A))J 


where  en„(A)  denotes  the  maximum 


Proof  of  Lemma  4.1: 


eigenvalue  of  A. 


Note  that 


x’A'SAx 


= 'W(E-|,!a'S1  3S1,jAE-|/3). 


Let  B = EI/3AE~I/3,  Then  this  maximum  eigenvalue  is  e„„(B'B)  < e^lA’A) 
[em„(A|]2  by  a propeny  of  the  spectral  norm.  C 

Lemma  4.2  Let  the  linear  smoother  S be  a self-adjoint  operator  with  singular 
values  € [0,1).  Then  tiar(  J||S0||3)  < uar(j||d||3).  where  S is  symmetric  with 
singular  values  € [0, 1|.  (Recall  0 = y,  - yj.) 

Proof  of  Lemma  4.2: 

We  will  show  that,  for  any  n and  for  any  positive  definite  covariance  matrix 
E of  0.  ear(j!|S8||3)  < uar(L||0||!)  where  S is  symmetric.  (A  symmetric  matrix 
is  the  discrete  form  of  a self-adjoint  operator  (see  Ramsay  and  Silverman,  1997. 
pp.  287-290).) 


and  so  we  apply  Lemma  4.1  with  S'S  playing  the  role  of  A.  Since  the  singular 
values  of  S are  e (0, 1],  all  the  eigenvalues  of  S'S  are  e [0, 1|.  Hence  we  have 


« var(||S0||3)  < rar(||0||3) 
s=>  var(0'S  SO)  < uar(O'O) 

» 2tr[(S'SE)3]  + 40'S  SES  S0  < 2tr[(E)3|  + 40'E0. 


40'SSESS0  < I0'E0  <= 


0'S'SSS'S0 


; show  tr[(S'S£)2|  < tr[(£)2]  for  any  £. 


ir|(E)2]  = f rj(SSE  + (I  — S’S)E)2| 

= tr((S'S£)2]  + lr(((I  - S S)E)2]  + (r[S’SE(I  - S S)E] 

+ tr|(I  - S S)ES  SEj 

= /r[(S'S£)2)  + (r[£l,2(I  - S'S)E(I  - S'S)EI,!| 

+ tr[S£(I  - S S)ESJ  + (r(SE(I  - S'S)ES'] 

= tr[(S'S£)2]  + lr[E‘'2(I  - S'S)E(I  - S'S)E1,!] 

+ 2(r(SE(I  - S’S)SS] 

= <r|(S'S£)2)  + fr(E,,2(I  - S'S)E,,3£''2(I  - S'SJE1'2) 

+ 2ir[SE(I  - S’S)‘'2(I  - S'S),,3ES! 

> (rj(S'SE)2] 

since  the  last  two  terms  are  > 0.  This  holds  since  tr[E1,2(I  - S'S)E1,2EI/2(I  - 
S'S)£I/2]  is  the  sum  of  the  squared  elements  of  the  symmetric  matrix  E1/2(I  - 
S'S)E1/2.  And  lr(SE(I  - S’S)1/!(I  - S’S)I,2ES1  is  the  sum  of  the  squared  elements 
of  (I  - S'S)1/2ES.  □ 

We  now  extend  the  result  in  Lemma  -1.2  to  functional  space. 

Corollary  4.1  Under  the  conditions  of  Lemma  4-2,  var[ii,\  < uar^f™"1*1]. 

Proof  of  Corollary  4.1: 

Note  that  since  almost  sure  convergence  implies  convergence  in  distribution, 

di,  = -9'6  4 i„ 

JpmoMA)  = Zffs'sS  4 

Consider  £f[|d,y|3]  = £((d„)3i  = g£[(0'0)3).  We  wish  to  show  that  this  is 
bounded  for  n > 1. 
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Now  86  ~ yJ(A).  where  A = 6 0 here.  The  third  moment  of  a noncentral 
chi-square  is  given  by 

n3  + 6n2  + 6n3A  + 8n  + 36nA  + 12nA3  + 48A  + 48A3  + 8A3. 

Hence  ^£[(8  8)3j  = 

T“[l  + 6/n  + 6A/n  + 8/n2  + 36A/n2  + 12AJ/na  + 48A/n3  + 48A  2/n3  + 8A3/n3] 

Now.  A/n  is  bounded  for  n > 1 since  '^6  6 is  the  discrete  approximation  of  the 
distance  between  signal  curves  pi (/)  and  p,(l),  which  tends  to  6ij.  Hence  the  third 
moment  is  bounded. 

This  boundedness,  along  with  the  convergence  in  distribution,  implies  (see 
Chow  and  Teicher.  1997.  p.  277)  that 

Jim  £[(4)‘)  = E[(6ij)k\  < oo 

for  k = 1,2. 

So  we  have 

Jtawor&l 

= lim£((d|j)2|  - lim(£(d,j)]3 
= lim£[(4)2]-[lim£(d1)))2 
= £[(i«)3l  - [B(4)]2 
= Bar[ij,-)  < oo. 


An  almost  identical  argument  yields 
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As  a consequence  of  Lemma  4.2,  since  iiar(djfm00*A,J  < uarjrf,;]  for  ail  n, 

< oarfo]-  □ 

The  following  theorem  establishes  the  domination  of  the  smooihed-data 
dissimilarity  estimator  over  t[1(,  observed-data  estimator  L,  when  the 

signal  curves  /it  ( f ) . i = 1 — . .V.  lie  in  the  linear  subspace  defined  by  S. 
Theorem  4.4  if  S is  a seif-adjoint  operator  and  a shrinking  smoother,  and  if 

jij(t)  is  in  the  linear  subspace  which  S projects  onto  for  alii  = 1 .V,  then 

AfSF(iJ'™°“’)  < MSEiS,,). 

Proof  of  Theorem  4-f: 

Define  the  difference  in  MSEs  A to  be  as  follows  (we  want  to  show  that  A 
negative): 

A = E{(6['m°°"'1  - 6,,)1)  - E{(6„  - *„)*} 

= s{(jf  WQ-Mt)]**-  J {p,lt)-pl(t)\1dtj  J 

- £■{  (f*M)  - vAt)]1  dt  - j\(t)  - n(t)f  d,)1} 

= e{(1  [%.(<)  - 5lf,(()]J dt)  + ( jfVw  - «(«))’ dt) 2 

~2{l  [Sw(l)  ■ (jfW)  - PAttfdt)  } 

- £{ (l  I*W  - »«]**)  + (jfwo  - PiWfdt)2 

- 2(/T|v*«)  - WWfdf)  (jf  M t)  - Mtffdt)  J 
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= £ j QT[S»(‘)  - S»(t)  1*  *)  ’ - (f\‘W  ~ »(‘)P *)  ’} 

+ 2^{  (fjyw  - ft(oprfi)  (jfr;rt(«)  - «(»)!’<#) 

- (jf [S»W  ~ %(*)]’*)  (jTwo  - *(»>?<#) }• 

Let  f = /y(t)  = y,{t)  - „,(t)  and  let  f = /„(()  = *(«)  - W(£).  Note  that  since  S is 
a linear  smoother.  SW(£)  - 5Wj(0  = 5f. 

+*{(£vr*)(j*w*)  - (/w*)(/v*)} 

= £{(Sf.  Sf)J  - (f.  f)!)  + 2£{(f,  f)(f.  f)  - (Sf,  Sf)(f.  f» 

= £{||Sf||4  - ilfll4}  + 2£{||f||a||f||l  - ::Sf||J||f|f} 

= £{||Sf|n  - £{||f||*}  - 2||f||J£{||Sf||5  - ||f||1} 

= mr(||Sf||>)  - imr(||f|p)  + {£(||Sf||a)}J  - {£(||f||a)}J 
- 2||f||a£{(||Sf||*  - Uf|P)} 

= «ar(||Sf||>)  - var(||f|p) 

+{£(||Sf|p)  + £(||f||J)}{£(||Sf||!)  - £(||f|P)} 
-2||f||a£{(||Sf||J-||f|p)}. 

Since  <i,((|  is  in  the  subspace  which  S projects  onto  for  all  i,  then 
Sf  = f=»||Sf||  = ||f||. 

A = unrlHSfll1)  - ™r(||f||5)  + £(||Sf|f  + ||f||3)£(||Sf||J  - ||f||*) 

-(l|Sf||’+l|f||,)£(l|Sf||,-||f||a). 
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Now,  since  »ar(||Sf||s)  < car(||f||3)  from  Corollary  4.1.  we  have: 

A < £(||Sf||3  + l|f||J)E(||S?||s  - ||f||J)  - (||Sf||>  + ||f||J)£(||Sf||’  - ||f||*) 

= BdlSfH1  - ||l||a)[£(||Sf||J  + ||f||»)  - (||Sf||J  + Ilfll2)].  (4.9) 

Since  S is  a shrinking  smoother,  ||Sf||  < ||f||  for  all  f.  Hence  ||Sf||J  < ||f||3  for 
all  i and  hence  £(||Sf||3)  < £(||f||J).  Therefore  the  first  factor  of  (4.9)  is  < 0. 

Also  note  that 

S(||Sf||*)  = b{  jTlSf]1*}  = £ E\si\Ut 
= f\ar{Sf\dt  + JT(ElS(\)’dt 
= £ var[St\dt  + £\St]*dt 
= f var(5f) t/(  + ||Sf||* > ||5f||3. 

Similarly, 

£(II?IIJ)  = fijjTV*}  =[e  ifp  di 
= f\arlHdl  + £(ElH)*dt 
= J var{f\dl  + J\f?dl 
= jf  irar(f)  dt  + ||f||a  > ||f||*. 

Hence  the  second  factor  of  (4.9)  is  positive,  implying  that  (4.9)  < 0.  Hence  the 
risk  difference  A < 0.  □ 

Remark.  We  know  that  ||Sf||  < ||f||  since  S is  shrinking.  If,  for  our  particular 
p,(«)  and  itj(t),  ||Sf||  < ||f||,  then  our  result  is  < MSE&j). 


CHAPTER  5 

A PREDICTIVE  LIKELIHOOD  OBJECTIVE  FUNCTION 
FOR  CLUSTERING  FUNCTIONAL  DATA 
Most  cluster  analysis  methods,  whether  deterministic  or  stochastic,  use  an 
objective  function  to  evaluate  the  “goodness"  of  any  clustering  of  the  data.  In  this 
chapter  we  propose  an  objective  function  specifically  intended  for  functional  data 
which  have  been  smoothed  with  a linear  smoother.  In  setting  up  the  situation, 
we  assume  we  have  -V  functional  data  objects  yi(f J, ....  yv(f).  We  assume  the 
discretized  data  we  observe  (yi. , . . . y.v ) follow  the  discrete  noise  model  (2.1). 

For  object  i.  define  the  indicator  vector  Zj  = (Z, i.  — ZiK)  such  that  Z,,  = 1 if 
object  i belongs  to  cluster  j and  0 otherwise.  Each  partition  in  C corresponds  to  a 
different  Z = (Z|, . . .. Zjv).  and  Z then  determines  the  clustering  structure.  (Note 
that  if  the  number  of  clusters  K < N,  then  Z^k+i  = ■■■  = ZM  = 0 for  all  i.) 

In  cluster  analysis,  we  wish  to  predict  the  values  of  the  Z,'s  based  on  the  data 

y (or  data  curves  (pi(f) y.v(l))-  Assume  the  Z,’s  are  i.i.d.  unit  multinomial 

vectors  with  unknown  probability  vector  p = (pi ps).  Also,  we  assume  a 

model  /(y|z)  for  the  distribution  of  the  data  given  the  partition.  The  joint  density 

/( y.*)  = 

can  serve  as  an  objective  function.  However,  the  parameters  in  /(y|z),  and  p, 
are  unknown.  In  model-based  clustering,  these  parameters  are  estimated  by 
maximizing  the  likelihood,  given  the  observed  y,  across  the  values  of  z.  If  the 
data  model  /(y|z)  allows  it,  a predictive  likelihood  for  Z can  be  constructed  by 
conditioning  on  the  sufficient  statistics  r(y,z)  of  the  parameters  in  /( y,s)  (Butler, 
1986;  Bjomstad,  1990).  This  predictive  likelihood,  /(y,*)//(r(y,z)),  can  serve  as 
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the  objective  function.  (Bjornstad  (1990]  notes  that,  when  y or  z is  continuous, 
an  alternative  predictive  likelihood  with  a Jacobian  factor  for  the  transformation 
r(y , z)  is  possible.  Including  the  Jacobian  factor  makes  the  predictive  iikelihood 
independent  of  the  choice  of  minimal  sufficient  statistic:  excluding  it  makes  the 
predictive  likelihood  invariant  to  scale  changes  of  z.) 

For  instance,  consider  a functional  datum  y,(t)  and  its  corresponding  observed 
data  vector  y,.  For  the  purpose  of  cluster  analysis,  let  us  assume  that  functional 
observations  in  the  same  cluster  come  from  the  same  linear  smoothing  model  with 
a normal  error  structure.  That  is.  the  distribution  of  Y,.  given  that  it  belongs  to 
cluster  j,  is 

Y,  I Zy  = 1 ~ ,V(Tft,ffJl) 

for  i = 1 -V.  Here  T is  the  design  matrix,  whose  n rows  contain  the  regression 

covariates  defined  by  the  functional  form  we  assume  for  y,(t).  This  model  for 
Y,  allows  for  a variety  of  parametric  and  nonparametric  linear  basis  smoothers, 
including  standard  linear  regression,  regression  splines,  Fourier  series  models, 
and  smoothing  splines,  as  long  as  the  basis  coefficients  are  estimated  using  least 
squares. 

For  example,  consider  the  yeast  gene  data  described  in  Section  7.1.  Since 
the  genes,  being  cell-cycle  regulated,  are  periodic  data.  Booth  et  al.  (2001)  use  a 
first-order  Fourier  series  y(()  = ft  + ft  cos(2irt/T)  + ftsin(2irt/r)  as  a smooth 
representation  of  a given  gene's  measurement,  fitting  a harmonic  regression  model 
(see  Brock  well  and  Davis.  1996,  p.  12)  by  least  squares.  Then  the  18  rows  of  T arc 

the  vectors  (1,  cas(&tj),$in(%tj)),  for  the  n = 18  time  points  tj,j  = 0 17.  In 

this  example  ft  = (fty,ft/,ftf)'. 

Let  J = [j\zi,  = 1 for  some  :}  be  the  set  of  the  K nonempty  clusters. 

The  unknown  parameters  in  /(y,z),j  e J,  are  (py.ft.oj).  The  corresponding 
sufficient  statistics  are  (m,,ft,  dj)  where  m,  = £i=i  z,y  is  the  number  of  objects 


:h  duster.  0j  is  the  p'-di 


and  aj.) 


on  m,  ji,  is  multivariate  normal  and  independent  of  Then 


Therefore  log/(y,s 


- tt «{>«»  ♦ '■« (5^ )" ■ - sjjO*- - W"  - *«} 

- - TOd'fa-TS,)} 

- 5?  [E  MVi  - T^)'(y,  - 18,)  + mj(T4,  - Tfl,)'(T^  - Tg,)]  J. 


Hence  /( y,  z) 

= n^fe)  x 

(yf  - T/§,)'(y,  - Tg,)  (Tg,  - Tg, )'(Tg,  - Tg,)  1 

**  2<rJ  m’  2a)  J 

-rn,n/j  -™,n  f (Tg,  - Tg,)'(Tg,  - Tg,)  (m,n  - p- 


j e*p{-i(g,  - g>)'|«»(/§J)]-I(4J  - g,)} 


m,\  (2t)p-/2!cou(/3j)|i/2 

oj cxp{-i('n^-p  a])}(mi"gj’p  ) 

exp  | -^(g,  - B,)‘(ri){8,  - g,)} 


Hence,  since  exp{-gK/j,  - g,)’(T-T)(g,  - 0,))  = exp{-^(Tg,  - 
Tg,)'(Tg,  - Tg,)),  we  have  that 


- n^.r-»(»)i-^r(^)(^)J 


CHAPTER  6 
SIMULATIONS 

6.1  Setup  of  Simulation  Study 

In  this  chapter,  we  examine  the  results  of  a simulation  study  implemented 
using  the  statistical  software  R.  In  each  simulation.  100  samples  of  N = 18  noisy 
functional  data  were  generated  such  that  the  data  followed  models  (2.1)  or  (2.2). 
For  the  discrete  noise  data,  the  errors  were  generated  as  independent  7V(0,  rr2) 
random  variables,  while  for  the  functional  noise  data,  the  emus  were  generated  as 
the  result  of  a stationary  Ornstein-Uhlenbcck  process  with  variability  parameter  tr 
and  “pull”  parameter  0 = 1. 

For  each  sample  of  curves,  several  (four)  "clusters"  were  built  into  the  data 
by  creating  each  functional  observation  from  one  of  four  distinct  signal  curves,  to 
which  the  random  noise  was  added.  Of  course,  within  the  program  the  curves  were 
represented  in  a discretized  form,  with  values  generated  at  n equally  spaced  points 
along  T = (0. 20j.  In  one  simulation  example,  n = 200  measurements  were  used  and 
in  a second  example,  n — 30. 

The  four  distinct  signal  curves  for  the  simulated  data  were  defined  as  follows: 

Hi(t)  = 0.5  ln(t  + 1)  + .01  cos(t),  i = 1 5. 

in(t)  = log,„(f  + 1)  - ,01cos(2f),i  = 6 10. 

Ui(t)  = 0.75  log5(i  + l)  + ,01sin(3<),  > = 11 15. 

#q(f)  = 0.3v'T+T-.01sin(«),t  = 16 18. 


68 


These  four  curves,  shown  in  Figure  6-1,  .were  intentionally  chosen  to  be  similar 
enough  to  provide  a good  test  for  the  clustering  methods  that  attempted  to 
group  the  curves  into  the  correct  clustering  structure,  yet  different  enough  that 
they  represented  four  clearly  distinct  processes.  They  all  contain  some  form  of 
periodicity,  which  is  more  prominent  in  some  curves  than  others.  In  short,  the 
curves  were  chosen  so  that,  when  random  noise  was  added  to  them,  they  would  be 
difficult  but  not  impossible  to  distinguish. 

After  the  "observed"  data  were  generated,  the  pairwise  dissimilarities  among 
the  curves  (measured  by  squared  £?  distance,  with  the  integral  approximated 
on  account  of  the  discretized  data)  were  calculated.  Denote  these  by  dp,  i = 

1 .V.  j = 1 rV.  The  data  were  then  smoothed  using  a linear  smoother  S 

(see  below  for  details)  and  the  James-Stein  shrinkage  adjustment.  The  pairwise 
dissimilarities  among  the  smoothed  curves  were  then  calculated.  Denote  these  by 

jpnuMA)'  ^ _ 1 V,  j = 1, . . . , Af,  The  sample  mean  squared  error  criterion  was 

used  to  judge  whether  dp  or  better  estimated  the  dissimilarities  among  the 

signal  curves,  denoted  dp,  i = 1 N,}=  1 N.  That  is, 

met*'  = ^ £ [(4 -^)J1 

was  compared  to 

= _L  £ 1(4'™““’- dp)2]. 

Once  the  dissimilarities  were  calculated,  we  used  the  resulting  dissimilarity 
matrix  to  cluster  the  observed  data,  and  then  to  cluster  the  smoothed  data.  The 
clustering  algorithm  used  was  the  K-medoids  method,  implemented  by  the  pom 
function  in  R.  We  examine  the  resulting  clustering  structure  of  both  the  observed 


Figure  6-1:  Plot  of  signal  curves  chosen  for  simulations. 
Solid  line:  /i, (()■  Dashed  line:  /io(0-  Dotted  line:  Dot-dashed  line: 


data  and  the  smoothed  data  to  determine  which  clustering  better  captures  the 
structure  of  the  underlying  clusters,  as  defined  by  the  four  signal  curves. 

The  outputs  of  the  cluster  analyses  were  judged  by  the  proportion  of  pairs  of 
objects  (i.e..  curves)  correctly  placed  in  the  same  cluster  (or  correctly  placed  in 
different  clusters,  as  the  case  may  be).  The  correct  clustering  structure,  as  defined 
by  the  signal  curves,  places  the  first  five  curves  in  one  cluster,  the  next  five  in 
another  cluster,  etc.  Note  that  this  is  a type  of  measure  of  concordance  between 
the  clustering  produced  by  the  analysis  and  the  true  clustering  structure. 

6.2  Smoothing  the  Data 

The  smoother  used  for  the  n = 200  example  corresponded  to  a cubic  B-spiine 
basis  with  16  interior  knots  interspersed  evenly  within  the  interval  [0. 20).  The  rank 
of  S was  thus  k = 20  (Ramsay  and  Silverman,  1997.  p.  49).  The  value  of  a in  the 
Jamcs-Stein  estimator  was  chosen  to  be  a = 160.  a choice  based  on  the  values  of 
n = 200  and  k = 20.  For  the  n = 30  example,  a cubic  B-spline  smoother  with  six 
knots  was  used  (hence  k = 10)  and  the  value  of  a was  chosen  to  be  o = 15. 

We  should  note  here  an  interesting  operational  issue  that  arises  when  using 
the  .lames-Stein  adjustment.  The  James-Stein  estimator  of  9 given  by  (3.4)  is 
obtained  by  smoothing  the  differences  9 = y,  — y; . i yt  j,  first,  and  then  adjusting 
the  S0‘s  with  the  James-Stein  method.  This  estimator  for  9 is  used  in  the  James- 
Stein  dissimilarity  estimator  For  a data  set  with  N curves,  this  amounts  to 
smoothing  (jVJ  - ;V)/2  pairwise  differences.  It  is  computationally  less  intensive  to 
simply  smooth  the  N curves  with  the  linear  smoother  S and  then  adjust  each  of 
the  N smooths  with  the  James-Stein  method.  This  leads  to  the  following  estimator 
of  9: 


| (yt  — Syi)  — Syj  — 
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Of  course,  when  simply  using  a linear  smoother  S,  it  does  not  matter  whether 
we  proceed  by  smoothing  the  curves  or  the  differences,  since  S0  = Sy<  — Sy,.  But 
when  using  the  James-Stein  adjustment,  the  overall  smooth  is  nonlinear  and  there 
is  an  difference  in  the  two  operations.  Since  in  certain  situations,  it  may  be  more 
sensible  to  smooth  the  curves  first  and  then  adjust  the  N smooths,  we  can  check 
empirically  to  see  whether  this  changes  the  risk  very  much. 

To  accomplish  this,  a simulation  study  was  done  in  which  the  coefficients  of 
the  above  signal  curves  were  randomly  selected  (within  a certain  range)  for  each 
simulation.  Denoting  the  coefficients  of  curve  by  (6,.i . 6,j),  the  coefficients 
were  randomly  chosen  in  the  following  intervals: 

6,.,  € [— 3.3].6|.s  6 [-0.5. 0.5).  V,  6 [-6,6],6c.s  6 [-0.5. 0.5), 

6,u  € [-4.5, 4.5).  6,13  6 [-0.5, 0.5),  6icj  e [-1.8,1.8],6,cj  € !-0.5,0.5|. 

For  each  of  100  simulations,  the  sample  MSE  for  the  smoothed  data  was  the  same, 
within  10  significant  decimal  places,  whether  the  curves  were  smoothed  first  or  the 
differences  were  smoothed  first.  This  indicates  that  the  choice  between  these  two 
procedures  has  a negligible  effect  on  the  risk  of  the  smoothed-data  dissimilarity 

Also,  the  simulation  analyses  in  this  chapter  were  carried  out  both  when 
smoothing  the  differences  first  and  when  smoothing  the  curves  first.  The  results 
were  extremely  similar,  so  we  present  the  results  obtained  when  smoothing  the 
curves  first  and  adjusting  the  smooths  with  the  James-Stein  method. 

6.3  Simulation  Results 

Results  for  the  n = 200  example  are  shown  in  Table  6-1  for  the  data  with 
independent  errors  and  Table  6-2  for  the  data  with  Ornstein-lfhlenbeck-type  errors. 
The  ratio  of  average  MSEs  is  the  average  MSE,°b‘)  across  the  100  simulations 
divided  by  the  average  MSE1'™"*1  across  the  100  simulations.  As  shown  by 


Table  6-1:  Clustering  the  observed  data  and  clustering  the  smoothed  data  (inde- 
pendent error  structure,  n = 200). 


Independent  error  structure 


io  of  avg.  MSEs  1 12.86 
prop,  (observed)  I .668 
prop,  (smoothed)  |j  .937 


30.10  62.66  i 70.89  72.10  72.24  71.03 


.428  ; .380  .372 


O-U  error  structure 


0.5  0.75  | 1.0  1.25  1.5 


Ratio  of  avg.  MSEs  3.53  39.09  | 126.53  171.39  176.52  | 168.13 
Avg.  prop,  (observed)  .997  .607 
Avg.  prop,  (smoothed) 


the  ratios  of  average  MSEs  being  greater  than  1.  smoothing  the  data  results  in 
an  improvement  in  estimating  the  dissimilarities,  and.  as  shown  graphically  in 
Figure  6 -2,  it  also  results  in  a greater  proportion  of  pairs  of  objects  correctly 
clustered  in  every  case.  Similar  results  arc  seen  for  the  n = 30  example  in  Table 
6-3  and  Table  6-4  and  in  Figure  6-3. 

The  improvement,  it  should  be  noted,  appears  to  be  most  pronounced  for 
the  medium  values  of  <r\  which  makes  sense.  If  there  is  small  variability  in  the 
data,  smoothing  yields  little  improvement  since  the  observed  data  is  not  very 
noisy  to  begin  with.  If  the  variability  is  quite  large,  the  signal  is  relatively  weak 
and  smoothing  may  not  capture  the  underlying  structure  precisely,  although  the 
smooths  still  perform  far  better  than  the  observed  data  in  estimating  the  true 
dissimilarities.  When  the  magnitude  of  the  noise  is  moderate,  the  advantage  of 
smoothing  in  capturing  the  underlying  clustering  structure  is  sizable. 


74 


shed  line:  Observed  da 
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The  above  analysis  assumes  the  number  of  clusters  is  correctly  specified  as 
4.  so  we  repeat  the  n = 200  simulation,  for  data  generated  from  the  O-U  model, 
when  the  number  of  clusters  is  misspecificd  as  3 or  o.  Figure  6-4  shows  that,  for 
the  most  part,  the  same  superiority,  measured  by  proportion  of  pairs  correctly 
clustered,  is  exhibited  by  the  method  of  smoothing  before  clustering. 

6.4  Additional  Simulation  Results 

In  the  previous  section,  the  clustering  of  the  observed  curves  was  compared 
with  the  clustering  of  the  smooths  adjusted  via  the  James-Stein  shrinking  method. 
To  examine  the  effect  of  the  James-Stein  adjustment  in  the  clustering  problem,  in 
this  section  we  present  results  of  a simulation  study  done  to  compare  the  clustering 
of  the  observed  curves,  the  James-Stein  smoothed  curves,  and  smoothed  curves 
having  no  James-Stein  shrinkage  adjustment. 

The  simulation  study  was  similar  to  the  previous  one.  We  had  20  sample 
curves  (with  n = 30)  in  each  simulation,  again  generated  from  four  distinct  signal 
curves:  in  this  case  the  signal  curves  could  be  written  as  Fourier  series.  Again,  for 
various  simulations,  both  independent  errors  and  Ornstein-Uhlenbeck  errors  (for 
varying  levels  of  a2)  were  added  to  the  signals  to  create  noisy  curves. 

For  some  of  the  simulations,  the  sigual  curves  were  first-order  Fourier  series: 

IM(t)  = cos(irt/10)  + 2sin(irt/10),  i = 1, . ...  5. 

«(t)  = 2 cos(ir!/10)  + sin(trf/10),  i = 6 10. 

Pi(t)  = 1.5cos(irf/10)  + sln(wt/10),t  = ll,...,15. 

Pi(t)  = - cos(xi/10)  + sin(irt/10),  i = 16 20. 

In  these  simulations,  the  smoother  chosen  was  a first-order  Fourier  series,  so  that 
the  smooths  lay  in  the  linear  subspace  that  the  Fourier  smoother  projected  onto.  In 
this  case,  we  would  expect  the  unadjusted  smoothing  method  to  perform  well,  since 


Figure  6-4:  Proportion  of  pairs  of  objects  correctly  matched,  plotted  against  a. 
Solid  line:  Smoothed  data.  Dashed  line:  Observed  data. 


the  first-order  smooths  should  capture  the  underlying  curves  extremely  well.  Here 
the  James-Stein  adjustment  may  be  superfluous,  if  not  detrimental. 

In  other  simulations,  the  signal  curves  were  second-order  Fourier  series: 

Pi(t)  = cos(iri/10)  + 2sin(irt/10)  + 5cos(2x(/10)  + sin(2irt/10),i  = 1,...,5. 
tt(t)  = 2cos(:rf/10)  +sin(ir!/10)  + cos(2irt/10)  + 8sin(2xl/10),i  = 6, . . . , 10. 

tn(t)  = 1.5cos(irt/10)  + sin(jrt/10)  + 5cos(2ir(/10)  + 8sin(2irt/10),  i = 11 15. 

p,(i)  = -cos(7rt/10)  + sin(jrt/10)  + 8cos(2irl/10)  + 5sin(2irt/10),i  = 16 20. 

In  these  simulations,  again  a first-order  Fourier  smoother  was  used,  so  that  the 
smoother  oversmoothed  the  observed  curves  and  the  smooths  were  not  in  the 
subspace  S projected  onto.  In  this  case,  the  simple  linear  smoother  would  not  be 
expected  to  capture  the  underlying  curves  well,  and  the  James-Stein  adjustment 
would  be  needed  to  reproduce  the  true  nature  of  the  curves. 

For  each  of  the  three  methods  (no  smoothing,  simple  linear  smoothing, 
and  James-Stein  adjusted  smoothing),  and  for  varying  magnitudes  of  noise,  the 
proportions  of  pairs  of  curves  correctly  grouped  are  plotted  in  Figure  6-5.  Again, 
for  each  setting,  100  simulations  were  run,  with  results  averaged  over  all  100  runs. 

The  James-Stein  adjusted  smoothing  method  performs  essentially  no  worse 
than  its  competitors  in  three  of  the  four  cases.  With  Ornstein-Uhlenbeck  errors, 
and  when  the  signal  curves  are  first-order  Fourier  series,  the  simple  linear  smoother 
does  better  as  a becomes  larger,  which  is  understandable  since  that  smoother 
captures  the  underlying  truth  perfectly.  For  higher  values  of  a (noisier  data),  the 
James-Stein  adjustment  is  here  shrinking  the  smoother  slightly  toward  the  noise 
in  the  observed  data.  Nevertheless,  the  James-Stein  smoothing  method  has  an 
advantage  in  the  independent-errors  case,  even  when  the  signal  curves  are  first- 
order.  In  both  of  these  cases,  both  types  of  smooths  are  clustered  correctly  more 
often  than  the  observed  curves. 


When  the  signal  curves  are  second-order  and  the  simple  linear  smoother 
oversmooths,  the  unadjusted  smooths  are  clustered  relatively  poorly.  Here,  the 
James-Stcin  method  is  needed  as  a safeguard,  to  shrink  the  smooth  toward  the 
observed  data.  The  James-Stein  adjusted  smooths  and  the  observed  curves  are 
both  clustered  correctly  more  often  than  the  unadjusted  smooths. 


CHAPTER  7 

ANALYSIS  OF  REAL  FUNCTIONAL  DATA 

7.1  Analysis  of  Expression  Ratios  of  Yeast  Genes 

As  an  example,  we  consider  yeast  gene  data  analyzed  in  Alter  et  al.  (2000). 
The  objects  in  this  situation  arc  78  genes,  and  the  measured  responses  are  (log- 
transformed)  expression  ratios  measured  18  times  (at  7-minute  intervals)  tj  = 7j 

for  j = 0 17  As  described  in  Spellman  et  al.  (1998).  the  yeast  was  analyzed 

in  an  experiment  involving  "alpha-factor-based  synchronization,”  after  which  the 
RNA  for  each  gene  was  measured  over  time.  (In  fact,  there  were  three  separate 
synchronization  methods  used,  but  here  we  focus  on  the  measurements  produced  by 
the  alpha-factor  synchronization.) 

Biologists  believe  that  these  genes  fall  into  live  clusters  according  to  the  cell 
cycle  phase  corresponding  to  each  gene. 

To  analyze  these  data,  we  treated  them  as  78  separate  functional  observations. 
Initially,  each  multivariate  observation  on  a gene  was  centered  by  subtracting  the 
mean  response  (across  the  18  measurements)  from  the  measurements.  Then  we 
clustered  the  genes  into  5 clusters  using  the  K-medoids  method,  implemented  by 
the  R function  pam. 

To  investigate  the  effect  of  smoothing  on  the  cluster  analysis,  first  we  simply 
clustered  the  unsmoothed  observed  data.  Then  we  smoothed  each  observation  and 
clustered  the  smooth  curves,  essentially  treating  the  fitted  values  of  the  smooths 
as  the  data.  A (cubic)  B-spline  smoother,  with  two  interior  knots  interspersed 
evenly  within  the  given  timepoints,  was  chosen.  The  rank  of  the  smoothing  matrix 
S was  in  this  case  * = 6 (Ramsay  and  Silverman.  1997,  p.  49).  The  smooths  were 
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adjusted  via  the  James-Stein  shrinkage  procedure  described  in  Section  3.2,  with 
a = 5 here,  a choice  based  on  the  values  of  n = 18  and  k = 6. 

Figure  7-1  shows  the  resulting  clusters  of  curves  for  the  cluster  analyses  of 
the  observed  data  and  of  the  smoothed  curves.  We  can  see  that  the  clusters  of 
smoothed  curves  tend  to  be  somewhat  less  variable  about  their  sample  mean  curves 
than  are  the  clusters  of  observed  curves,  which  could  aid  the  interpretability  of  the 
clusters. 

In  particular,  features  of  the  second,  third,  and  fourth  clusters  seem  more 
apparent  when  viewing  the  clusters  of  smooths  than  the  clusters  of  observed 
data.  The  major  characteristic  of  the  second  cluster  is  that  it  seems  to  contain 
those  processes  which  are  relatively  flat,  with  little  oscillation.  This  is  much  more 
evident  in  the  cluster  of  smooths.  The  third  and  fourth  clusters  look  similar  for 
the  observed  data,  but  a key  distinction  between  them  is  apparent  if  we  examine 
the  smooth  curves.  The  third  cluster  consists  primarily  of  curves  which  rise  to 
a substantial  peak,  decrease,  and  then  rise  to  a more  gradual  peak.  The  fourth 
cluster  consists  mostly  of  curves  which  rise  to  a small  peak,  decrease,  and  then  rise 
to  a higher  peak.  This  distinction  between  clusters  is  seen  much  more  clearly  in  the 
clustering  of  the  smooths. 
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Figure  7-1:  Plots  of  clusters  of  genes. 

Observed  curves  (left)  and  smoothed  curves  (right)  are  shown  by  dotted  lines. 
Mean  curves  for  each  cluster  shown  by  solid  lines. 


The  groupings  of  the  curves  into  the  live  clusters  are  given  in  Table  7-1  for 
both  the  observed  data  and  the  smoothed  data.  While  the  clusterings  were  fairly 
similar  for  the  two  methods,  there  were  some  differences  in  the  way  the  curves  were 
classified.  Figure  7-2  shows  an  edited  picture  of  the  respective  clusterings  of  the 
curves,  with  the  curves  deleted  which  were  classified  the  same  way  both  as  observed 
curves  and  smoothed  curves.  Note  that  the  fifth  cluster  was  the  same  for  the  set  of 
observed  curves  and  the  set  of  smoothed  curves. 

7.2  Analysis  of  Research  Libraries 

The  Association  of  Research  Libraries  has  collected  extensive  annual  data 
on  a large  number  of  university  libraries  throughout  many  years  (Association  of 
Research  Libraries.  2003).  Perhaps  the  most  important  (or  at  least  oft-quoted) 
variable  measured  in  this  ongoing  study  is  the  number  of  volumes  held  in  a library 
at  a given  time.  In  this  section  a cluster  analysis  is  performed  on  67  libraries  whose 
data  on  volumes  were  available  for  each  year  from  1967  to  2002  (a  string  of  36 
years).  This  fits  a strict  definition  of  functional  data,  since  the  number  of  volumes 
in  a library  is  changing  “nearly  continuously''  over  time,  and  each  measurement  is 
merely  a snapshot  of  the  response  at  a particular  time,  namely  when  the  inventory 
was  done  that  year. 

The  goal  of  the  analysis  is  to  determine  which  libraries  are  most  similar  in 
their  growth  patterns  over  the  last  3-plus  decades  and  to  see  how  many  groups  into 
which  the  67  libraries  naturally  clustered.  In  the  analysis,  the  logarithm  of  volumes 
held  was  used  as  the  response  variable,  since  reducing  the  immense  magnitude  of 
the  values  in  the  data  sot  appeared  to  produce  curves  with  stable  variances.  Since 
the  goal  was  to  cluster  the  libraries  based  on  their  "growth  curves"  rather  than 
simply  their  sizes,  the  data  was  centered  by  subtracting  each  library's  sample  mean 
(across  the  36  years)  from  each  year's  measurement,  resulting  in  67  mean-centered 
data  vectors. 


Figure  7-2:  Edited  plots  of  dusters  of  genes  which  were  classified  differently  as 
Observed  curves  (left)  and  smoothed  curves 


(right). 


The  observed  curves  were  each  smoothed  using  a basis  of  cubic  B-spiines  with 
three  knots  interspersed  evenly  within  the  timcpoints.  The  positive-part  James- 
Stein  shrinkage  was  applied,  with  a value  a = 35  in  the  shrinkage  factor  (based  on 
n = 36,  k = 7).  The  K-ntedoids  algorithm  was  applied  to  the  smoothed  curves, 
with  K = 4 clusters,  a choice  guided  by  the  average  silhouette  width  criterion  that 
Rousseeuw  (1987)  suggests  for  selecting  A’. 

The  resulting  clusters  are  shown  in  Table  7-2.  In  Figure  7-3.  we  see  the 
distinctions  among  the  curves  in  the  different  clusters.  Cluster  1 contains  a few 
libraries  whose  volume  size  was  relatively  small  at  the  begining  of  the  time  period, 
but  then  grew  dramatically  in  the  first  12  years  of  the  study,  growing  slowly  after 
that.  Cluster  2 curves  show  a similar  pattern,  except  that  the  initial  growth  is  less 
dramatic.  The  behavior  of  the  cluster  3 curves  is  the  most  eccentric  of  the  four 
clusters,  with  some  notable  nonmonotone  behavior  apparent  in  the  middle  years. 
The  growth  curves  for  the  libraries  in  cluster  4 is  consistently  slow,  steady,  and 
nearly  linear. 

For  purposes  of  comparing  the  clusters,  the  mean  curves  for  the  four  clusters 
are  shown  in  Figure  7-4. 

Note  that  several  of  the  smooths  may  not  appear  visually  smooth  in  Figure 
7-3.  A close  inspection  of  the  data  shows  some  anomalous  (probably  misrecorded) 
measurements  which  contribute  to  this  chaotic  behavior  in  the  smooth.  (For 
example,  see  the  data  and  associated  smooth  for  the  University  of  Arizona  library, 
in  Figure  7-5,  for  which  1971  is  a suspicious  response.)  While  an  extensive  data 
analysis  would  detect  and  fix  these  outliers,  the  data  set  is  suitable  to  illustrate 
the  cluster  analysis.  The  sharp-looking  peak  in  the  smooth  is  an  artifact  of  the 
discretized  plotting  mechanism,  as  the  cubic  spline  is  mathematically  assured  to 
have  two  continuous  derivatives  at  each  point. 


Figure  7-3:  Plots  of  clusters  of  libraries. 

Clusters  based  on  smoothed  curves.  Mean  curves  for  each  cluster  shown  by  solid 
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Figure  7-4:  Mean  curves  for  the  four  library  clusters  given  on  the  same  plot. 
Solid:  cluster  1.  Dashed:  duster  2.  Dotted:  cluster  3.  Dot-dash:  duster  4. 


Table  7-2:  A -1-cluster  K-medoids  clustering  of  the  smooths  for  the  library  data. 


Clusters  of  smooths  for  library  data 

! V.Y.-O  - / A:  ! Y . i y',,-  ;i;i 

Cluster  2;  Boston.  California-Los  Angeles,  Florida.  Florida  State,  Iowa. 

Iowa  State,  Kaosas.  Michigan  State,  Nebraska.  Northwestern.  Notre  Dame, 
Oklahoma.  Pennsylvania  State,  Pittsburgh,  Princeton.  Rochester, 

Southern  California.  SUNY-Buffalo,  Tempie,  Virginia.  Washington  U-St.  Louis, 

Wisconsin 

cc.-a  r v i'i  ' ' il ::  y:  iM  y-  ] A - r -- 1 i.  ■ ■ . ' Aui  , ! .m-innati,  Colorado.  Duke. 
Indiana,  Kentucky,  Louisiana  State,  Maryland.  McGill.  MIT.  Ohio  State, 

Oregon,  Pennsylvania.  Purdue,  Rutgers.  Stanford,  Tennessee.  Toronto.  Tulane, 

Utah,  Washington  State.  Wayne  State 

Cluster  i:  Columbia.  Cornell.  Harvard.  lllinois-Crliana.  Johns  Hopkins.  Michigan. 
Minnesota,  Missouri.  New  York,  North  Carolina.  Southern  Illinois.  Syracuse.  Yale  1 


Figure  7-5:  Measurements  and  B-spline  smooth.  University  of  Arizona  library. 
Points:  log  of  volumes  held,  1967-2002.  Curve:  B-spline  smooth. 


Table  7-3:  A 4-cluster  K-medoids  clustering  of  the  observed  library  data. 


Clusters  of  observed  library  data 

Cluster  1:  Arizona.  British  Columbia.  Connecticut.  Georgetown,  Georgia 
Cluster  3:  Boston.  C.tlifoMu  - . vo  . l.nvn. 

Iowa  State.  Kansas.  McGill.  Michigan  State.  Nebraska.  Northwestern, 

Notre  Oaine,  Oklahoma,  Pennsylvania  State.  Pittsburgh,  Princeton.  Rochester, 
Southern  California.  SUNY-Buffalo.  Temple,  Virginia,  Washington  U-St.  Louis, 

Wayne  State,  Wisconsin 

Cluster  3:  Brown,  Califomia-Berkeley,  Chicago,  Cincinnati,  Colorado.  Duke. 
Illinois-Urbana.  Indiana.  Kentucky,  Louisiana  State,  Maryland.  MIT, 

Ohio  State,  Oregon.  Pennsylvania,  Purdue.  Rutgers,  Stanford,  Syracuse, 
Tennessee,  Toronto,  Tulane.  Utah,  Washington  State 

■ .!  mils  Hopkins.  Michigan,  Minnesota, 

I Missouri.  New  York.  North  Carolina,  Southern  Illinois.  Yale 


Interpreting  the  meaning  of  the  clusters  is  fairly  subjective,  but  the  dramatic 
rises  in  library  volumes  for  the  cluster  1 and  cluster  2 curves  could  correspond 
to  an  increase  in  state-sponsored  academic  spending  in  the  1060s  and  1970s, 
especially  since  many  of  the  libraries  in  the  first  two  clusters  correspond  to  public 
universities.  On  the  other  hand,  cluster  4 contains  several  old,  traditional  Ivy 
League  universities  (Harvard,  Yale,  Columbia.  Cornell),  whose  libraries'  steady 
growth  could  reflect  the  extended  buildup  of  volumes  over  many  previous  years. 
Any  clear  conclusions,  however,  would  require  more  extensive  study  of  the  subject, 
as  the  cluster  analysis  is  only  an  exploratory  first  step. 

In  this  data  set.  only  minor  differences  in  the  clustering  partition  were  seen 
when  comparing  the  analysis  of  the  smoothed  data  with  that  of  the  observed 
data  (shown  in  Table  7-3).  Interestingly,  in  some  spots  where  the  two  partitions 
differed,  the  clustering  of  the  smooths  did  place  certain  libraries  from  the  same 
state  together  when  the  other  method  did  not.  For  instance,  Illinois-Urbana  was 
placed  with  Southern  Illinois,  and  Syracuse  was  placed  with  Columbia,  Cornell,  and 
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CHAPTER  8 

CONCLUSIONS  AND  FUTURE  RESEARCH 

This  dissertation  has  addressed  the  problems  of  clustering  and  estimating 
dissimilarities  for  functional  observations. 

We  began  by  reviewing  important  clustering  methods  and  connecting  the 
clustering  output  to  the  dissimilarities  among  the  objects.  We  then  described 
functional  data  and  provided  some  background  about  functional  data  analysis, 
which  has  become  increasingly  visible  in  the  past  decade.  An  examination  of  the 
recent  statistical  literature  reveals  a number  of  methods  proposed  for  clustering 
functional  data  (James  and  Sugar,  2003;  Tarpcy  and  Kinateder.  2003;  Abraham  et 
al..  2003).  These  methods  involve  some  type  of  smoothing  of  the  data,  and  we  have 
provided  justification  for  this  practice  of  smoothing  before  clustering. 

We  proposed  a model  for  the  functional  observations  and  hence  for  the 
dissimilarities  among  them.  When  the  data  are  smoothed  using  a basis  function 
method  (such  as  regression  splines,  for  example),  the  resulting  smoothed-data 
dissimilarity  estimator  dominates  the  estimator  based  on  the  observed  data  when 
the  pairwise  differences  between  the  signal  curves  lie  in  the  linear  subspace  defined 
by  the  smoothing  matrix. 

When  the  differences  do  not  lie  in  the  subspace,  we  have  shown  that  a James- 
Stein  shrinkage  dissimilarity  estimator  dominates  the  observed-data  estimator 
under  an  independent  error  model.  With  dependent  errors,  an  asymptotic  (for 
n large  within  a fixed  domain)  domination  result  was  given.  (While  the  result 
appears  to  hold  for  moderately  large  n,  the  asymptotic  situation  of  the  theorem 
corresponds  to  data  which  are  nearly  pure  functions,  measured  nearly  continuously 
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across  some  domain.)  The  shrinkage  estimator  is  a novel  way  to  unite  linear 
smoothers  and  Stein  estimation  to  derive  a useful  smoothing  method. 

A good  estimation  of  dissimilarities  is  only  a preliminary  step  in  the  eventual 
goal,  the  correct  clustering  of  the  data.  Thus  we  have  presented  a simulation  study 
which  indicates  that,  for  a set  of  noisy  functional  data  generated  from  a four-cluster 
structure,  the  data  which  are  smoothed  before  a K-medoids  cluster  analysis  are 
classified  more  correctly  than  the  observed,  unsmoothed  data.  The  correctness  of 
each  clustering  was  determined  by  a measure  of  concordance  between  the  clustering 
output  and  the  underlying  structure. 

Some  real  functional  data  were  analysed,  with  the  James-Stein  shrinkage 
smoother  applied  to  two  data  sets,  which  were  then  clustered  via  K-medoids.  One 
data  set  consisted  of  78  yeast  genes  whose  expression  level  was  measured  across 
time,  while  the  other  data  set  consisted  of  the  “growth  curves"  (number  of  volumes 
held  measured  over  time)  of  67  research  libraries. 

The  applications  for  this  research  are  wide-ranging,  since  functional  data 
are  gathered  in  many  fields  from  the  biological  to  the  social  sciences.  Besides  the 
applications  mentioned  in  this  dissertation,  functional  data  analyses  in  archaeol- 
ogy, economics,  and  biomechanics,  among  others,  are  presented  by  Ramsay  and 
Silverman  (2002). 

One  obvious  extension  of  this  work  that  could  be  addressed  in  the  future  is 
to  consider  linear  smoothers  for  which  S is  not  idempotent.  While  a symmetric, 
idempotent  S corresponds  to  an  important  class  of  smoothers,  many  other  common 
smoothers  such  as  kernel  smoothers,  local  polynomial  methods,  and  smoothing 
splines  have  smoothing  matrices  which  are  symmetric  but  not  in  general  iderapo- 

The  use  of  a basis  function  smoother  requires,  in  some  sense,  that  the  curves  in 
the  data  set  all  have  the  same  basic  structure,  since  the  same  set  of  basis  functions 


is  used  to  estimate  eacli  signal  curve.  Fortunately,  methods  such  as  regression 
splines,  especially  when  the  number  of  knots  is  fairly  large,  are  flexible  enough 
to  approximate  well  a variety  of  shapes  of  curves.  Also,  given  our  emphasis  on 
working  with  the  pairwise  differences  of  data  vectors  (i.e.,  discretized  curves),  for 
the  differences  to  make  sense,  we  need  each  curve  in  the  data  set  to  be  measured  at 
the  same  set  of  points  t, i„,  another  limitation. 

The  abstract  case  of  data  which  come  to  the  analyst  as  pure  continuous 
functions  was  resolved  in  this  thesis  in  the  case  for  which  the  signal  curves  were  in 
the  subspaco  defined  by  the  linear  operator  S,  but  the  more  general  case  still  needs 
further  work.  A more  abstract  functional  analytic  approach  to  functional  data 
analysis,  such  as  Bosq  (2000)  presents,  could  be  useful  in  this  problem. 

The  practical  distinction  that  arises  when  using  the  .lames-Stein  adjustment 
between  smoothing  the  pairwise  differences  in  the  noisy  curves  as  opposed  to 
smoothing  the  noisy  curves  themselves  was  addressed  in  Section  6.2.  Empirical  ev- 
idence was  given  indicating  that  it  mattered  very  little  in  practice  which  procedure 
was  done,  so  one  could  feel  safe  in  employing  the  computationally  faster  method  of 
smoothing  the  noisy  curves  directly.  More  analytical  exploration  of  this  is  needed, 
however. 

With  the  continuing  growth  in  computing  power  available  to  statisticians, 
computationally  intensive  methods  like  stochastic  cluster  analysis  will  become  more 
prominent  in  the  future.  Much  of  the  work  that  has  been  done  in  this  area  has 
addressed  the  clustering  of  traditional  multivariate  data,  so  stochastic  methods  for 
clustering  functional  data  are  needed.  Objective  functions  designed  for  functional 
data,  such  as  the  one  proposed  in  Chapter  5,  are  the  first  step  in  such  methods. 
The  next  step  is  to  create  better  and  faster  search  algorithms  to  optimize  such 
objective  functions  and  discover  the  best  clustering. 


In  a broad  sense,  i here  is  an  increasing  need  for  methods  of  analyzing  large 


functional  data  sets.  With  today's  advanced  monitoring  equipment,  scientists  can 
measure  responses  nearly  continuously,  over  time  or  some  other  domain,  for  a large 
number  of  individuals.  Statisticians  need  to  have  methods,  both  exploratory  and 
confirmatory*,  to  discover  genuine  trends  and  variability  across  great  masses  of 
curves.  The  field  of  data  mining  (Hand  et  al.,  2000)  has  begun  to  provide  some 
answers  for  these  problems,  and  functional  data  provide  another  rich  class  of 
problems  to  address. 


APPENDIX  A 

DERIVATIONS  AND  PROOFS 

A.l  Proof  of  Conditions  for  Smoothed-data  Estimator  Superiority 

Now.  since  S is  a shrinking  smoother,  this  means  ||Sy||  < ||y||  for  all  y,  and 
hence  ||Sy||a  < ||y||*  for  ally.  Therefore.  ||S9||3  < ||0||2.  Hence,  aud  because 
k < n.  we  see  that  the  first  term  of  (3.2)  is  less  than  the  first  term  of  (3.1),  and 
that  the  second  term  of  (3.2)  is  < 0.  The  third  term  of  (3.2)  will  be  positive, 
though. 

Therefore,  the  only  way  the  observed-data  estimator  would  have  a smaller 
MSE  than  the  smoothed-data  estimator  is  if  the  negative  quantity  ||Sfl||!  - ||0||2  is 
of  magnitude  large  enough  that  its  square  (which  appeals  in  the  third  term  of  (3.2) 
overrides  the  advantage  (3.2)  has  in  the  first  two  terms.  This  corresponds  to  the 
situation  when  the  shrinking  smoother  shrinks  too  much  past  the  underlying  mean 
curve.  This  phenomenon  is  the  same  one  seen  in  simulation  studies. 

We  can  make  this  more  mathematically  precise.  Note  that 


» + £]  + £|2*(||S0||»  - ||0||»)]  + 


^1(118011*  - ||9||J)3]  < 0 


(Let  w = ||S0|p  - ||fl||*.) 


.24  4s  2 


- 1|  + ^j(4  + 2k)w  + ^u>2  < 0 


Blip  + bw  + c<0. 


Here,  k < o,  so  a > 0.  b > 0,  and  e < 0.  Thus  the  left  hand  side  < 0 when 


Note  that  v/64  - lac  = j?,/(4  + 2ft)1  + 4(2n  + o1  - 24  - 4s).  So  the  left 
endpoint  of  the  interval  in  (A.l) 


Similarly,  the  right  endpoint  of  (A.l)  is  -2  - k + \/4  + 2o  + o2  + 24.  Hence, 
the  smoothed-data  estimator  is  better  in  terms  of  MSE  when 

-2-4-V4  + 2o  + os  + 24<  ||S0||3 - ||0||s  < -2-4  + V4  + 2o  + oa  + 24.  (A.2) 

Note  that  for  any  o > 1 and  integer  k < n,  the  right  hand  side  of  (A.2)  is 
greater  than  0.  Since  ||Sfl|p  - ||fl|p  < 0,  the  second  inequality  of  (A.2)  is  always 
satisfied  in  our  setting.  Hence,  the  smoothed-data  estimator  is  better  when: 


(A.l) 


-4-24- y/16  + 8n  + 4n2  + 84 


-4-24-2^4  + 20  + 02  + 24 


= -2  — 4-V4  + 2o  + o2  + 24. 


liseip  - non1 


1 > -2  - 4 - V4  + 2o  + oa  + 24.  Q 


Extension  of  Stochastic  Domination  Result 


Recall  that  we  have  a linear  combination  of  n independent  noncentral  Xi 
random  variables  and  a linear  combination  of  n independent  central  xj  variates. 

We  know  that  for  each  Sf  > 0,  i = 1, . . . . n, 

PlxM)  > *]  > Ptf  > *1  vx>o. 

Since  C|,  — c„  > 0.  letting  a,  — c,x,  i = 1 n. 

Pfexi5^?)  > «t]  > P|CtXj  > =,]  V*>0. 

(If  c,  = 0,  the  inequality  trivially  holds.)  Now.  since  f(x, r„)  = x,  + . . . +x„  is 

an  increasing  function,  and  since  the  n noncentra)  chi-squares  are  independent  and 
the  n central  chi-squares  are  independent,  we  apply  a result  given  in  Ross  (1996.  p. 
410.  Example  9.2(A))  to  conclude: 

P[<W(*1 ) + • • • + Cnx'i’fllj)  > l]  > P(ciXl  + • • • + CnXl  > xj  V X > 0.  □ 

A.3  Definition  and  Derivation  of  0,  and  <7? 

Recall  that  jq  is  the  response  vector  for  object  i,  and  we  denote  the  design 
matrix  by  T.  Then  0y  the  estimator  of  the  coefficients  of  the  regression  of  all  the 
objects  in  cluster  j,  is: 


= KTTl-'fry, + -.+rymj) 

= ^E<T"r>'‘T'y> 
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SSEj,  the  sura  of  squared  errors  of  the  regression  of  all  the  objects  in  cluster  j,  is: 
SSE,  = (y,' ynj') 

-bn' y»/ 

= yi'yi  + — ym,'ymj 

- (yi'T  + • • • + ym/T)[mJT'T|-|(T'y1  + • • • + Tymi) 

= yi'yi+  — y«.,’ym, 

- ^-(yi'T  + ■ ■ • + ymi’T)[(T'T)-,T'y1  + • • • + (rTJ-Ty.,] 

= ^-[(y.'yi  - y.TfrTl-ry,)  + • • ■ + (y^'y^  - ymjT(T'T)-,T,yn*)l 


(where  S is  the  m;ti  x mjti  block  matrix  with  I„  in  each  diagonal  block  and 
— T)  ~ lT'  in  each  off-diagonal  block) 

- (£§***)  »a=V, y-,w wn- 

Hence,  the  mean  squared  error  of  the  regression  of  all  the  objects  in  cluster  j, 


-£>?  + — I — (HiLi 
n>  iZl  min  ~ p‘  V mi 


»,')S(y  r ymi’)')J. 


APPENDIX  B 

ADDITIONAL  FORMULAS  AND  CONDITIONS 

The  four  roots  (two  real,  two  imaginary)  of  Ay  = 0 are  0 , pj/3  + (2/9)”  - 
(4/3)Pi/(-n  + ft).  and  the  double  imaginary  root  -(l/2)p$/3  - (1/9)”  - 
(4/3)Pt/(-n  + k)  + (1/2),V5(p5,S  - (2/9)”). 

Here, 

Pi  = n*-2nft-6n  + ft*  + 6ft  + 8. 

Pi  = (2/27)p,(— 3072ft  + 3072r  - I232n*  + 2464n*  - 1232ft*  + 60n3- 
132ft*  - on'  - 252n*ft  + 324nft*  + lln*ft  - 3n*t*  - 7 n*3  + 4ft*  - 2048)+ 

(-n  + ft)8  + (2/9K-442368*  + 344064n  - 1032192n*  - 147456n*  + 1179648**+ 
1250304n8  + 1274880*3  - 781056n®  - 1238016n*ft  - 1287168nft*  + 1658880n3*- 
238080n*ft*  - 1376256n*3  + 736512ft8  + 248064ft5  + 265536ns  - 870336*n®+ 
769728n*ft*  + 257472n*ft3  - 670464nft8  + 211680n5ft  - 327216n®ft*  + 133440n3ft5+ 
152496n*ft*  - 172320n*5  - 21636nsft  + 48288n5**  - 44712n8*3  + 588n3t'+ 
31308ftV  - 22776ft8n  + 576n'ft  - 1776n‘ft*  + 2520n3*3  - 1080n®ft‘  - 47088n5 
+4 9008ft®  + 3660n7  + 5280ft'  - 72n®  + 240ft5  - 1488n3ft5  + 2304n*ft®- 
1224n*'  - 15n*ft  + 27n'ft*  - 15ftV  + 27*V  - 15n®*3  - 15n3*5+ 

3f.*ft'  + 3„5) + (-«  + *))■/» + (-„  + *), 

Pi  = (768 ft  - 768n  + 344n*  - 688n*  + 344ft*  - 42n3  + 54ft3  - n*  + 138n*ft 
-150nft*  + n3ft  + 3n*ft*  - 5n*a  + 2ft®  + 512)  + ((-n  + ft)*p}/a). 
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B.2  Regularity  Condition  fo 


G in  Laplace 


Recall  the  Li 


Approximation 

dination  given  by  Liebennan  (1994)  for  the  expecta- 


tion of  a ratio  of  quadratic  forms: 


E 


E(x'Fx) 

E(x'Gx)' 


Lioberman  (1994)  denotes  the  joint  moment  generating  function  of  x'Fx  and  x'Gx 
by 


and  assumes  a positive  definite  G.  In  that  case.  E(x'Gx)  > 0.  Liebennan  uses  the 
positive  definiteness  of  G to  show  that  the  derivative  of  the  cumulant  generating 
function  of  x’Gx  is  greater  than  zero,  That  is. 


The  positive  derivative  ensures  the  maximum  of  !ogM(0,«2)  is  attained  at  the 
boundary  point  (where  ui2  = 0). 

For  positive  semidefinite  G.  we  need  the  additional  regularity  condition  that 
P{x  Gx  > 0)  > 0,  i.e.,  the  support  of  x’Gx  is  not  degenerate  at  zero.  This  will 
ensure  that  E(x'Gx)  > 0,  i.e.,  that  > 0.  Therefore  both  of  the 

integrals  in  the  above  expression  are  positive  and  the  Laplace  approximation  will 


Af(«i,uj)  = £[exp(wi x Fx  - wjx'Gx)]. 


log  iW(O.wj)  = ^-W(0,uj)  j,\f(0.iv2)J 


= y"  (x'Gx)  exp{i^ix'Gx)/(x)  dx  \^j  exp{wjx'Gx}/(x)dxJ 
> 0. 
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Cluster  analysis,  which  places  objects  into  reasonable  groups  based  on  statis- 
tical data  measured  on  them,  is  an  important  exploratory  tool  for  the  social  and 
biological  sciences.  We  explore  the  particular  problem  of  clustering  functional  data. 

data  include  growth  curves  and  biomechanics  data  measuring  movement.  In  recent 
years,  methods  for  smoothing  and  clustering  functional  data  have  appeared  in  the 
statistical  literature,  but  little  has  specifically  addressed  the  effect  of  smoothing  on 

We  examine  the  effect  of  smoothing  functional  data  on  estimating  the  dis- 
similarities among  objects  and  on  clustering  those  objects.  Through  theory  and 
simulations,  a shrinkage  smoothing  method  is  shown  to  result  in  a better  estimator 
of  the  dissimilarities  and  a more  accurate  grouping  than  using  unsmoothed  data. 
Two  examples,  involving  yeast  gene  expression  levels  and  research  library  "growth 
curves,"  illustrate  the  technique. 


